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Abstract 

Bayesian model averaging, model selection and its approximations such as BIC are generally statisti- 
cally consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and 
leave-one-out cross-validation. On the other hand, these other methods can be inconsistent. We identify 
the catch-up phenomenon as a novel explanation for the slow convergence of Bayesian methods. Based 
on this analysis we define the switch distribution, a modification of the Bayesian marginal distribution. 
We show that, under broad conditions, model selection and prediction based on the switch distribution is 
both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The 
method is practical; we give an efficient implementation. The switch distribution has a data compression 
interpretation, and can thus be viewed as a "prequential" or MDL method; yet it is different from the 
MDL methods that are usually considered in the literature. We compare the switch distribution to Bayes 
factor model selection and leave-one-out cross-validation. 

1 Introduction: The Catch-Up Phenomenon 

We consider inference based on a countable set of models (sets of probability distributions), focusing on two 
tasks: model selection and model averaging. In model selection tasks, the goal is to select the model that 
best explains the given data. In model averaging, the goal is to find the weighted combination of models that 
leads to the best prediction of future data from the same source. 

An attractive property of some criteria for model selection is that they are consistent under weak con- 
ditions, i.e. if the true distribution P* is in one of the mo dels, then the P * -probability that this model is 
selected goes to one as the sample size increases. BIC JSchwara . 1197811 . Baye s factor model sele ction 
I Kass and Rafteryll 199511 . Mi nimum Descri ption Length (MDL) model selection JBarron et all. 1199811 and 
prequential model validation fDawig, 1198411 are examples of widely use d model select ion criteria that are 
usually consistent. How ever, other m odel selection criteria such as AIC JAkaikd . 1197411 and leave-one-out 
cross-validation (LOO) iStoneLll977ll . while often inconsistent, do typically yield better predictions. This is 
especially the case in nonparametric settings of the following type: P* can be arbitrarily well-approximated 
by a sequence of distributions in the (parametric) models under consideration, but is not itself contained in 



*A preliminary version of a part of this paper appeared as Ivan Erven et al.Ll2007fl . 



any of these. In many s uch cases, the p redi ctive d istribution converges to the true distribution at the optimal 



I , tne p i 



rate for AIC and LOO llShibatal . 1 19831. iLi 1 1987n . whereas in general MDL, BIC, the B ayes factor m ethod 



and prequentia l validation only achie ye the op timal rate to within an 0(log n) factor JRissanen et al.Ul99Z . 
Foster and Geo rge. 1994, Yang, 1999l. lGrunwa ld. 2007]. In this paper we reconcile these seemingly conflict- 
ing approaches JYangL l2005ail by improving the rate of convergence achieved in Bayesian model selection 
without losing its consistency properties. First we provide an example to show why Bayes sometimes con- 
verges too slowly. 



1.1 The Catch-Up Phenomenon 

Given priors on parametric models 7V4i, A^2> • • • and parameters therein, Bayesian inference associates each 
model A4k with the marginal distribution p^, given by 



Pfc(x") 



eeSk 



pg{x'')w{e)d0. 



obtained by averaging over the parameters according to the prior. In Bayes factor model selection the pre- 
ferred model is the one with maximum a posteriori probability. By Bayes' rule this is arg max;, pk{x'^)w{k), 
where w{k) denotes the prior probability of A^^. We can further average over model indices, a process called 
Bayesian Model Averaging (BMA). The resulting distribution pbma{x"') = YlkPk{'x^)'^{^) ^^"^ be used for 
prediction. In a sequential setting, the probability of a data sequence x" := xi, . . . , x„ under a distribution 
p typically decreases exponentially fast in n. It is therefore common to consider — logp(x"), which we call 
the code length of x" achieved by p. We take all logarithms to base 2, allowing us to measure code length 
in bits. The name code length refers to the correspondence between code length functions and probability 
distributions based on the Kraft inequality, but one may also think of the code length as the accumulated 
log loss that is incurred if we sequentially predict the x,; by condit ioning on the past, i.e. using p(-|x*~^) 
llBarron et all 1 19981. iGrunwaldl. l2007l. IPawidL Il984l iRissanenL Il984il . For BMA, we have 



logPbma(x"') = -logJJpbma(Xi 



J-1^ 



^ [-logpbn 
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Here the ith term represents the loss incurred when predicting Xj given x*^ using Pbma(' |x*^ ), which turns 
out to be equal to the posterior average: Pbma(xi|x*^^) = YlikPk{xi\x'^~^)w{k\x'^^^). 

Prediction using pbma has the advantage that the code length it achieves on x" is close to the code length 
of p^., where k is the best of the marginals pi,p2, ■ ■ ■, i-c. k achieves min^ — logpfc(x"). More precisely, 
given a prior w on model indices, the difference between — logpbma(x") = — log(X]^, PA;(x")w(/i;)) and 
— logp^(x") must be in the range [0, — logw{k)], whatever data x" are observed. Thus, using BMA for 
prediction is sensible if we are satisfied with doing essentially as well as the best model under consideration. 
However, it is often possible to combine pi,p2, ■ ■ . into a distribution that achieves smaller code length than 
p^ ! This is possible if the index k of the best distribution changes with the sample size in a predictable 
way. This is common in model selection, for example with nested models, say M.i c M.2- In this case pi 
typically predicts better at small sample sizes (roughly, because M.2 has more parameters that need to be 
learned than M.i), while p2 predicts better eventually. Figure [T]illusti-ates this phenomenon. It shows the ac- 
cumulated code length difference — logp2(x") — (— logpi(x")) on "The Picture of Dorian Gray" by Oscar 
Wilde, where pi and p2 are the Bayesian marginal distributions for the first-order and second-order Markov 
chains, respectively, and each character in the book is an outcome. We used uniform (Dirichlet(l, 1, . . . , 1)) 
priors on the model parameters (i.e., the "transition probabilities") , but the same phenomenon occurs with 
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Figure 1 : The Catch-up Phenomenon 



other common priors , such as Jeffreys". Clearly pi is better for about the first 100 000 outcomes, gaining 
a head start of approximately 40 000 bits. Ideally we should predict the initial 100 000 outcomes using pi 
and the rest using p2. However, pbma only starts to behave like p2 when it catches up with pi at a sample 
size of about 310 000, when the code length of p2 drops below that of pi. Thus, in the shaded area pbma 
behaves like pi while p2 is making better predictions of those outcomes: since at n = 100 000, p2 is 40 000 
bits behind, and at n = 310 000, it has caught up, in between it must have outperformed pi by 40 000 bits! 
Note that the example models M.i and M.2 are very crude; for this particular application much better 
models are available. Thus M.i and M.2 serve as a simple illustration only (see the discussion in Sec- 
tion [HTll. However, our theorems, as well as experiments with nonparametric density estimation on which 
we will report elsewhere, indicate that the same phenomenon also occurs with more realistic models. In 
fact, the general pattern that first one model is better and then another occurs widely, both on real-world 
data and in theoretical settings. We argue that failure to take this effect into account leads to the suboptimal 
rate of convergence achieved by Bayes factor model selection and related methods. We have developed an 
alternative method to combine distributions pi and p2 into a single distribution psw, which we call the switch 
distribution, defined in Section |2l Figure [T] shows that psw behaves like pi initially, but in contrast to pbma 
it starts to mimic p2 almost immediately after p2 starts making better predictions; it essentially does this no 
matter what sequence x" is actually observed. Psw differs from pbma in that it is based on a prior distribution 
on sequences of models rather than simply a prior distribution on models. This allows us to avoid the implicit 
assumption that there is one model which is best at all sample sizes. After conditioning on past observa- 
tions, the posterior we obtain gives a better indication of which model performs best at the current sample 
size, thereby achieving a faster rate of convergence. Indeed, the switch distribution is very closely related to 
earlier al gorithms for tracking the best ex pert develop ed in the universal predict ion hterature; see also Sec - 
tion iTl LHerbster and Warmuth[ll998Ll\^k..l999...Volf and Willemsl.[l998...Monteleoni and JaakkolalEool : 



however, the applications we have in mind and the theorems we prove are completely different. 



1.2 Organization 

The remainder of the paper is organized as follows (for the reader's convenience, we have attached a table 
of contents at the end of the paper). In Section [2] we introduce our basic concepts and notation, and we 
then define the switch distribution. While in the example above, we switched between just two models, the 



general definition allows switching between elements of any finite or countably infinite set of models. In 
Section [3] we show that model selection based on the switch distribution is consistent (Theorem [Hi. Then 
in Section |4] we show that the switch distribution achieves a rate of convergence that is never significantly 
worse than that of Bayesian model averaging, and we show that, in contrast to Bayesian model averaging, 
the switch distribution achieves the worst-case optimal rate of convergence when it is applied to histogram 
density estimation. In Section [5] we develop a number of tools that can be used to bound the rate of con- 
vergence in Cesaro-mean in more general parametric and nonparametric settings, which include histogram 
density estimation as a special case. In Section 1531 and Section 1541 we apply these tools to show that the 
switch distribution achieves minimax convergence rates in density estimation based on exponential families 
and in some nonparametric linear regression problems. In Section[6]we give a practical algorithm that com- 
putes the switch distribution. TheoremOof that section shows that the run-time for k predictors is G(n • k) 
time. In Sections |7] and Section [8] we put our work in a broader context and explain how our results fit into 
the existing literature. Speci fically, Section 17711 explains how our result can be reconciled with a seemingly 
contradictory recent result of lYangI i2005 a1. and S ection [STT] describes a strange implication of the catch-up 



phenomenon for Bayes factor model selection. The proofs of all theorems are in Appendix |A] (except the 
central results of Section [51 which are proved in the main text). 

2 The switch distribution for Model Selection and Prediction 

2.1 Preliminaries 

Suppose X°° = {Xi, X2, . . .) is a sequence of random variables that take values in sample space X CW^ 
for some d € Z+ = {1, 2, . . .}. For n G N = {0, 1, 2, . . .}, let x"- = (xi, . . ., x„) denote the first n 
outcomes of X°°, such that x" takes values in the product space Af" = Afi x • • • x Af„. (We let x^ denote 
the empty sequence.) For m > n,\Ne. write X"^]^ for (X„+i, . . ., Xm), where m = 00 is allowed. We omit 
the subscript when n = 0, writing X™ rather than X™. 

Any distribution P{X°°) may be defined in terms of a sequential prediction strategy p that predicts the 
next outcome at any time n G N. To be precise: Given the previous outcomes x" at time n, a prediction strat- 
egy should issue a conditional density p{Xn+i\x^) with corresponding distribution P(X„+i |a;") for the next 
outcome X^_Li. Such sequential prediction strategies are sometimes called prequential forecasting systems 



I DawidLll9841] . An instance is given in Example [H below. Whenever the existence of a 'true' distribution 
P* is assumed — in other words, X°° are distributed according P* — , we may think of any prediction 
strategy p as a procedure for estimating P*, and in such cases, we will often refer to p an estimator. For 
simplicity, we assume throughout that the density p{Xn+i\x^) is taken relative to either the usual Lebesgue 
measure (if X is continuous) or the counting measure (if X is countable). In the latter case p{Xn-\-i\x"') is a 
probability mass function. It is natural to define the joint density p{x™'\x^) = p{xn+i\x"') ■ ■ ■ p{x„i\x"^~^) 
and let P{X^i\x'^) be the unique distribution on X°^ such that, for all m > n, p{X^_^_l\x^) is the density 
of its marginal distribution for X^^^. To ensure that P{X^^ |x") is well-defined even if X is continuous, 
we will only allow prediction strategies satisfying the natural requirement that for any k G Z"*" and any 
fixed measurable event A^^i C X^^i the probability P(Afc_(_i|x'^) is a measurable function of x*^. This 
requirement holds automatically if X is countable. 

2.2 Model Selection and Prediction 

In model selection the goal is to choose an explanation for observed data x" from a potentially infinite list 
of candidate models M-i, M2, ■ ■ ■ We consider /^aramefr/c models, which we define as sets {pg : 9 £ Q} 



of prediction strategies pg that are indexed by elements of G C R'^, for some smallest possible d € N, 
the number of degrees of freedom. A model is more commonly viewed as a set of distributions, but since 
distributions can be viewed as prediction strategies as explained above, we may think of a r nodel as a set of 
prediction strategies as well. Examples of model selection are histogram density estimation JRissanen et al. , 
1 199211 (d is the number of bins minus 1), regression based on a set of basis functions such as polyn omials {d 
is the nur nber o f coefficients of the polynomial), and the variable selection problem in regression JShibataL 



1983, L 



19871 lYangl 1 1999 *1 {d is the number of variables). A model selection criterion is a function 
5 : U^o '^" ~^ ^^ ^^'•' givsii ^iiy data sequence x" G Af" of arbitrary length n, selects the model M.k 
with index k = 6{x^). 

With each model Aik'we associate a single prediction strategy pk- The bar emphasizes that pk is a meta- 
strategy based on the prediction strategies in tM^. In many approaches to model selection, for example 
AIC and LOO, pk is defined using some parameter estimator 9^, which maps a sequence x" of previous 
observations to an estimated parameter value that represents a "best guess" of the true/best distribution in 
the model. Prediction is then based on this estimator: pfc(X„_|_i | x") = p^ ,„JXn-i-i \ x"), which 

also defines a joint density pk{x'"') = Pk{xi) • • • pkixnlx""'^). The Bayesian approach to model selection or 
model averaging goes the other way around. It starts out with a prior w on 0^, and then defines the Bayesian 
marginal density 



Pk{x 



P0{x'')wie)de. 



(1) 



eeOk 



When pk{x^) is non-zero this joint density induces a unique conditional density 



PkiX; 



71+1 



X 



Pk{Xn+l,X^ 

pk{x"-) 



which is equal to the mixture oi pg according to the posterior, w{9\x^) = pg{x'^)w{6)/ f pg{x'^)w{9) d9, 
based on x". Thus the Bayesian approach also defines a prediction strategy pkiXn+i |x"). 

Associating a prediction strategy pk wi th each model A4k is known as the prequential approach to 
statistics [Dawid. 19841 ov predictive MDL IIRissanenl . 1 198411 . Regardless of whether p^ is based on param- 
eter estima t ion or on Bayesian predictions, we may usually think of it as a universal code relative to J\4k 
llGrunwaldLlioOTll . 



Example 1. Suppose X = {0, 1}. Then a prediction strategy p may be based on the Bernoulli model 

-^ = {Pe I ^ G [0)1]} that regards Xi,X2,... as a sequence of independent, identically distributed 
Bernoulli random variables with PQ{Xn+i = 1) = 9. We may predict Xn+i using the maximum likelihood 



(ML) estimator based on the past, i.e. using 6'(x") = n X^"=i Xi. The prediction for xi is then undefined. 
If we use a smoothed ML estimator such as the Laplace estimator, 9'{x'"') = (n + 2)~^(^"^^ Xi + 1), 
then all predictions are well-defined. It is well-known that the predictor p' defined by p'{Xn+i \ x") = 
Plj,,^„\{Xn+i) equals the Bayesian predictive distribution based on a uniform prior. Thus in this case a 
Bayesian predictor and an estimation-based predictor coincide! 

In general, for a parametric model M.k^ we can define pi^{Xn+i \ x") = p^, . ^AXn+i) for some 

smoothed ML estimator ^^. The joint distribution with density Pfc(x") will then resemble, but in general 



not be precisely equal to, the Bayes marginal distribution with density pk{x^ 
llGrunwaldl.E()07[ Chapter 9]. 



under some prior on M^ 



2.3 The switch distribution 

Suppose pi, p2, ■ . . is a list of prediction strategies for X°°. (Although here the list is infinitely long, the 
developments below can with little modification be adjusted to the case where the list is finite.) We first 
define a family Q = {(7s:sGS}of combinator prediction strategies that switch between the original 
prediction strategies. Here the parameter space S is defined as 

S = {{ti,h), ..., {tm, km) e (N X Z+)™ I m e Z+, = ti < . . . < tra}. (2) 

The parameter s G S specifies the identities of m constituent prediction strategies and the sample sizes, 
called switch-points, at which to switch between them. For s = {{t'i,k'i), . . . , {t'^i, k'^,)), let tj(s) = t[, 
ki{s) = k'- and m(s) = m'. We omit the ai^gument when the parameter s is clear from context; e.g. we write 
ta for t3(s). For each s e S the corresponding Qs £ Qis defined as: 



gs(X„+i|x") = < 



PkAXn+llx^ 


if n < t2, 


Pk,iXn+l\x^) 


ii t2 < n < ^3, 


,_,(X„+i[x") 


if tm-l < ?1- < tm, 


PkJXn+lK) 


if tm < n. 



(3) 



Switching to the same predictor multiple times (consecutively or not) is allowed. The extra switch-point ti 
is included to simplify notation; we always take ti = 0, so that ki represents the strategy that is used in the 
beginning, before any actual switch takes place. 

Given a list of prediction strategies pi, p2, . . ., we define the switch distribution as a Bayesian mixture 
of the elements of Q according to a prior vr on S: 

Definition 1 (switch distribution). Suppose vr is a probability mass function on S. Then the switch distribu- 
tion Psw with prior vr is the distribution for {X°°,s) that is defined by the density 

Psw(2;",s)=gs(x")-^(s) (4) 

for any n G Z+, x^ e X^, and s € S. 

Hence the marginal likelihood of the switch distribution has density 

Psw(x") = ^gs(x")-vr(s). (5) 

sGS 

Although the switch distribution provides a general way to combine prediction strategies (see Section 173] ). 
in this paper it will only be applied to combine prediction strategies pi,p2,... that correspond to parametric 
models. In this case we may define a corresponding model selection criterion 5sw- To this end, let Kn+i '■ 
S — > Z'^ be a random variable that denotes the strategy/model that is used to predict Xn+i given past 
observations x". Formally, let iq be the unique i such that tj(s) < n and either tj+i(s) > n (i.e. the current 
sample size n is between the i-th and i + 1-st switch-point), or i = m{s) (i.e. the current sample size n 
is beyond the last switch point). Then Kn+i{s) = kig{s). Now note that by Bayes' theorem, the prior 
vr, together with the data x", induces a posterior 7r(s | x") oc qs{x^)TT{s) on switching strategies s. This 



posterior on switching strategies further induces a posterior on the model Kn+i that is used to predict X„_|_i. 
Algorithm [B given in Section|6l efficiently computes the posterior distribution on i^n+i given x": 

niK.,, = k I X") = ^{--^^..(-)-;>;;(-")-(-\ (6) 

which is defined whenever Psv,{x^) is non-zero, and can be efficiently computed using Algorithm [T] (see 
Section O. We turn this posterior distribution into the model selection criterion 

Ssv/ix"') = argmax7r(ir„+i = /c | x"), (7) 

k 

which selects the model with maximum posterior probability. 

3 Consistency 

If one of the models, say with index k*, is actually true, then it is natural to ask whether (5sw is consistent, in 
the sense that it asymptotically selects k* with probability 1. Theorem [T] below states that, if the prediction 
strategies p^ associated with the models are Bayesian predictive distributions, then (5sw is consistent under 
certain conditions which are only slightly stronger than those required for standard Bayes factor model 
selection consistency. It is followed by Theorem |2l which extends the result to the situation where the pj^ 
are not necessarily Bayesian. 

Bayes factor model selection is consistent if for all k, k' ^ k, Pj^{X°^) and Pj^./ (X°^) are mutually sin^ 



gular , that is, if there exists a measurable set A C X°° such that P^ {A) = 1 and P^/ {A) = JBarron et al. , 



1998h . For example, this can usually be shown to hold if (a) the models are nested and (b) for each k. 



Ofc is a subset of 0it;-i-i of uj^+i-measure 0. In most interesting applications in which (a) holds, (b) also 



holds BGriinwaldL 120071) . For consistency of 5sw. we need to strengthen the mutual singularity-condition to 
a "conditional" mutual singularity-condition: we require that, for all k' ^ k and all n, all x" G A"", the 
distributions Pk{X!^^ \ x") and Pfc/(X^^ | x") are mutually singular. For example, if Xi,X2, . . . are 
independent and identically distributed (i.i.d.) according to each Pq in all models, but also if X is countable 
and pk{xn+i I Xn) > for all k, all x""*"^ G A"""^^, then this conditional mutual singularity is automatically 
implied by ordinary mutual singularity of Pk{X°°) and Pk'{X'^). 

Let Pg = {s' G S I m{s') > m(s), (ti(s'), ki{s')) = (tj(s), /cj(s)) for i = 1, . . . ,?tt,(s)} denote the set 
of all possible extensions of s to more switch-points. Let pi, p2, ■ ■ ■ be Bayesian prediction strategies with 
respective pai^ameter spaces Bi, ©2, • • • and priors wi, W2, ■ ■ ■, and let tt be the prior of the conxsponding 
switch distribution. 

Theorem 1 (Consistency of the switch distribution). Suppose vr is positive everywhere on {s G S | m(s) = 
1} and such that for some positive constant c, for every s G S, c • 7r(s) > Tr{Es). Suppose further that 
Pk{X!^-^ I x") and Pk'{X^-^ \ x") are mutually singular for all k, k' G Z+, k ^ k', all n, all x" G X"-. 
Then, for all k* G Z"*", for all 9* G 0/j* except for a subset of @k* of Wk* -measure 0, the posterior 
distribution on Kn+i satisfies 

■K{Kn+i = k* I X") "^3' 1 with Pg, -probability 1. (8) 

The requirement that c • 7r(s) > 7r(i?s) is automatically satisfied if vr is of the form 

m 

7r(s) = TT„{m)Tr^{ki)Y\_'n-T{ti\ti > ti-i)TT^{ki), (9) 

j=2 



where vr^,, vr^ and -Kj are priors on Z+ with full support, and 7r„ is geometric: TTf^{m) = 6"^~^{1 — 9) for 
some < 6* < 1. In this case c = 9 /{I - 9). 

We now extend the theorem to the case where the universal distributions pi , p2 , • • • are not necessarily 
Bayesian, i.e. they are not necessarily of the form ([T]). It turns out that the "meta-Bayesian" universal distri- 
bution Psw is still consistent, as long as the following condition holds. The condition essentially expresses 
that, for each k, pk must not be too different from a Bayesian predictive distribution based on (O. This 
can be verified if all models A4fc are exponentia l famih es, and the pk represent ML or smoothed ML esti- 
mators (see Theorems 2.1 and 2.2 of JLi and Yul. 120001] ). We suspect that it holds as well for more general 



parametric models and universal codes, but we do not know of any proof. 

Condition There exist Bayesian prediction strategies p^,p2, ■ ■ ■ of form (dJ, with continuous and strictly 
positive priors wi,W2, ■ ■ ■ such that 

1. The conditions of Theorem [T] hold for p^,p2, ■ ■ ■ and the chosen switch distribution prior vr. 

2. For all k € Z+, for each compact subset G' of the interior of 0^, there exists a K such that for all 
9 € G', with Pe -probability 1, for all n 

-logpk{X^)+\ogpt{X^)<K. 

3. For all k, k' G Z+ with k ^ k' and all x" G X*, the distributions Pf (X^i | x") and -Pa,.'(X^i | x") 
are mutually singular. 

Theorem 2 (Consistency of the switch distribution. Part 2). Let pi , p2 , • • • be prediction strategies and let 
IT be the prior of the corresponding switch distribution. Suppose that the condition above holds relative to 
Pi,P2, ■ ■ ■ and vr. Then, for all k* G Z+, for all 9* E G^* except for a subset of@k* of Lebesgue-measure 
0, the posterior distribution on Kn+i satisfies 

7r{Kn+i = k* I X") "^^ 1 with Pe* -probability 1. (10) 

4 Risk Convergence Rates 

In this section and the next we investigate how well the switch distribution is able to predict future data 
in terms of expected logarithmic loss or, equivalently, how fast estimates based on the switch distribution 
converge to the true distribution in terms of Kullback-Leibler risk. In Section 14.11 we define the central 
notions of model classes, risk, convergence in Cesaro mean, and minimax convergence rates, and we give 
the conditions on the prior distribution vr under which our further results hold. We then (Section l4!2l ) show 
that the switch distribution cannot converge any slower than standard Bayesian model averaging. As a 
proof of concept, in Section 14.31 we present Theorem HJ which establishes that, in contrast to Bayesian 
model averaging, the switch distribution converges at the minimax optimal rate in a nonparametric histogram 
density estimation setting. 

In the more technical Section |5] we develop a number of general tools for establishing optimal con- 
vergence rates for the switch distribution, and we show that optimal rates are achieved in, for example, 
nonparametric density estimation with exponential families and (basic) nonparametric linear regression, 
and also in standard parametric situations. 



4.1 Preliminaries 



4,1,1 Model Classes 

The setup is as follows. Suppose A^i, M.2, ... is a sequence of parametric models with associated estimators 
Pi,P2, . . . as before. Let us write ^A = W^^-^Aik for the union of the models. Although formally Ai is 
a set of prediction strategies, it will often be useful to consider the corresponding set of distributions for 
X°° = (Xi, X2, . . .). With minor abuse of notation we will denote this set by A4 as well. 

To test the predictions of the switch distribution, we will want to assume that X°° is distributed accord- 
ing to a distribution P* that satisfies certain restrictions. These restrictions will always be formulated by 
assuming that P* € M.*, where A4* is some restricted set of distributions for X°°. For simplicity, we will 
also assume throughout that, for any n, the conditional distribution P*{Xn \ X^~^) has a density (relative 
to the Lebesgue or counting measure) with probability one under P*. For example, if ^ = [0, 1], then A4* 
might be the set of all i.i.d. distributions that have uniformly bounded densities with uniformly bounded first 
derivatives, as will be considered in Section 1431 In general, however, the sequence X°° need not be i.i.d. 
(under the elements of A^*). 

We will refer to any set of distributions for X°° as a model class. Thus both A4 and 7V4* are model 
classes. In Section 1531 it will be assumed that Ai* C M, which we will call the parametric setting. Most 
of our results, however, deal with various nonparametric situations, in which M.* \ 7W is non-empty. It will 
then be useful to emphasize that A^* is (much) larger than A4 by calling M* & nonparametric model class. 



4.1.2 Risk 

Given X"^^ = x"^^, we will measure how well any estimator P predicts X^, in terms of the KuUback- 
Leibler (KL) divergence D{P*{Xn = ■ \ 2;"-i)||P(X„ = • [ x""^)) JBarronLflQQsll . Suppose that P and Q 
are distributions for some random variable Y, with densities p and q respectively. Then the KL divergence 
from P to Q is defined as 

.p{Yy 



D{P\\Q) = Ep 



log' 



iY) 



KL divergence is never negative, and reaches zero if and only if P equals Q. Taking an expectation over 
X"-^^ leads to the standard definition of the risk of estimator P at sample size n relative to KL divergence: 



Tn{P\P) = E [D{P\Xn = ■ I X"-1)||P(X„ = • I X"-l))] . 



(11) 



Instead of the standard KL risk, we will study the cumulative risk 

n 

Rn{P*,P):=Y,^i{P*,P), 



(12) 



i=l 



because of its connection to information theoretic redundancy (see e.g. JBarronL 1 199811 or JGriinwaldl . 12007 , 
Chapter 15]): For all n it holds that 



Y,r^{P\P) = ^E 



i=l 



i=l 



log 



i-U 



P*{Xi I X 

v{Xi I x^-^] 



E 



logH 



p*{Xi I x^-^) 



Dfp*(")||p(")V (13) 



where the superscript (n) denotes taking the marginal of the distribution on the first n outcomes. We 
will show convergence of the predictions of the switch distribution in terms of the cumulative rather than 



the individual risk. This notion of convergence, defined below, is equiva lent to the well - studie d notion 
of con v ergence in Cesaro mean. It has been considered by, among others, Rissanen et alj 119921] . Barronl 



1 199811 . IPol and and Hutter jjOOSh . and its connections to ordinary convergence of the risk were investigated 



in detail by Griinwald [2007}]. 

Asymptotic properties like 'convergence' and 'convergence in Cesaro mean' will be e xpressed conve- 
niently using the following notation, which extends notation from 11 Yang and Barronl . ll999l] : 



Definition 2 (Asymptotic Ordering of Functions). For any two nonnegative functions g,h : Z+ — > M and 
any c > we write g <c ^ if for all e > there exists an no such that for all n > riQ it holds that 
g{n) < (1 + e) • c • h{n). The less precise statement that there exists some c > such that g :<c -h, will be 
denoted by g ^ h. (Note the absence of the subscript.) For c > 0, we define h "^c g io mean g :<i/c h, and 
h'^ g means that for some c> 0, h ^c g- Finally, we say that g >i hif both g <h and h < g. 

Note that g ^ his equivalent to g{n) = 0{h{n)). One may think of g{n) :<c h{n) as another way of 
writing limsup„_^oo g{n)/h{n) < c. The two statements are equivalent if h{n) is never zero. 

We can now succinctly state that the risk of an estimator P converges to at rate f{n) if rn{P*,P) ^i 
/(n), where / : Z+ ^^ M is a nonnegative function such that /(n) goes to as n increases. We say 
that P converges to at rate at least /(n) in Cesaro mean if - Yll=i ^i(^*i ^) ^i ~ Yll=i /(O- ^^ ^i" 
ordering is invariant under multiplication by a positive constant, convergence in Cesaro mean is equivalent 
to asymptotically bounding the cumulative risk of P as 

n n 

We will always express convergence in Cesaro mean in terms of cumulative risks as in (IT4l ). The reader may 
verify that if the risk of P is always finite and converges to at rate /(n) and lim„^oo Yl7=i fi^) ~ °'^' 
then the risk of P also converges in Cesaro mean at rate f{n). Conversely, suppose that the risk of P 
converges in Cesaro mean at rate f{n). Does this also imp ly that the risk of -P converges to at rate /(n) in 
the ordinary sense? The answer is "almost", as shown in JGrunwaldl. 120071] : The risk of P may be strictly 



larger than /(n) for some n, but the gap between any two n and n' > n at which the risk of P exceeds / 
must become infinitely large with increasing n. This indicates that, although convergence in Cesaro mean is 
a weaker notion than standard convergence, obtaining fast Cesaro mean convergence rates is still a worthy 
goal in prediction and estimation. We explore the connection between Cesaro and ordinary convergence in 
more detail in Section [ 



4.1.3 IMinimax Convergence Rates 

The worst-case cumulative risk of the switch distribution is given by 



Gsw(n)= sup Vr,(P*,P,w). (15) 

We will compare it to the minimax cumulative risk, defined as: 

n 

G„,„,.fix(n) := inf sup Vri(P*,P), (16) 

10 



where the infimum is over all estimators P as defined in Section 12.11 We will say that the switch distribu- 
tion achieves the minimax convergence rate in Cesaro mean (up to a multiplicative constant) if Ggw {n) :< 
Gmm-fix('^)- Note that there is no requirement that P(Xj+i | x*) is a distribution in J\A* or A^; We are look- 
ing at the worst case over all possible estimators, irrespective of the model class, M., used to approximate 
M* . Thus, we may call P an "out-model estimator" IGriinwaldl . 1200711 . 

4.1.4 Restrictions on the Prior 

Throughout our analysis of the achieved rate of convergence we will require that the prior of the switch 
distribution, vr, can be factored as in ^, and is chosen to satisfy 

-log7r4m) = 0(m), -log7rK(A;) = 0(log A;), - log7r,(f) = O(logt). (17) 

Thus tTm, the prior on the total number of distinct predictors, is allowed to decrease either exponentially 
(as required for Theorem [D or polynomially, but vr^ and vr^ cannot decrease faster than polynomially. For 
example, we co uld set 'Kr i t) = l /(t(t + 1)) and 7r^{k) = l/{k{k + 1)), or we could take the universal prior 
on the integers iRissanenlll983ll . 



4.2 Never Much Worse than Bayes 

Suppose that the estimators Pi, ^2) • • • are Bayesian predictive distributions, defined by their densities as in 
(dJ. The following lemma expresses that the Cesaro mean of the risk achieved by the switch distribution is 
never much higher than that of Bayesian model averaging, which is itself never much higher than that of any 
of the Bayesian estimators P^ under consideration. 

Lemma 3. Let Pj^ be the switch distribution for Pi , P2 , . . . with prior it of the form ^. Let Pbma be the 
Bayesian model averaging distribution for the same estimators, defined with respect to the same prior on 
the estimators tt^. Then, for all n G Z+, all x" G <Y", and all k € Z+, 

Psw(x") > vr„(l)pbma(x") > ^„(l)7r,(A:)pfc(x"). (18) 

Consequently, if Xi, X2, ■ ■ ■ are distributed according to any distribution P*, then for any k G Z+, 

n n n 

^r,(P*,Psw) < J^r,(P*,Pb„,a)-logvr„(l) < ^ri(P*,Pfc) - log 7r.(l) - log ^,(fe). (19) 

As mentioned in the introduction, one advantage of model averaging using pbma is that it always pre- 
dicts almost as well as the estimator pk for any k, including the pk that yields the best predictions overall. 
Lemma[3]shows that this property is shared by psw. which multiplicatively dominates pbma- In the sequel, we 
investigate under which circumstances the switch distribution may achieve a smaller cumulative risk than 
Bayesian model averaging. 

4.3 Histogram Density Estimation 

How many bins should be selected in density estimation based on histogram models with equal-width bins? 
Suppose Xi, X2, . . . take outcomes in ^ = [0, 1] and are distributed i.i.d. according to P* G A4*, where 
P*{Xn) has density p* for all n. Let p*(x") = nr=i P*{^i) for ^^ ^ '^"- L^t us restrict M* to the set of 



11 



distributions with densities that are uniformly bounded above and below and also have uniformly bounded 
first derivatives. In particular, suppose there exist constants < cq < 1 < ci and C2 such that 

M* = {P* I CO < p*{x) < ci and \d/dx p*{x)\ < C2 for all x G [0, 1]} . (20) 

In this setting the minimax convergence rate in Cesaro mean can be achieved using histogram models 
with bins of equal width (see below). The equal-width histogram model with k bins, Mk, is specified by the 
set of densities {p^} on A' = [0, 1] that are constant within the k bins [0, ai], (oi, 02], . . ., (afc_i, 1], where 
Oj = i/k. In other words, Mk contains any density po such that, for all x, x' G [0, 1] that lie in the same 
bin, pg{x) = pe{x'). The fc-dimensional parameter vector 9 = {61, . . . , 9k) denotes the probabihty masses 
of the bins, which have to sum up to one: Ylii=i ^i — 1- Note tha t this last constraint makes the number of 
degree s of freedom one less than the number of bins. Following IYu and Speed lll992n and iRissanen et al. 



1 199211 . we associate the following estimator p^ with model ^Ak'■ 



Pk{Xn+i I x") := —+^--' + ^ . k, (21) 

n + k 

where nx„+i(2;") denotes the number of outcomes in x" that fall into the same bin as X„+i. As in Ex- 
ample [U these estimators may both be interpreted as being based on parameter estimation (estimating 
^i = (nj(x") + l)/(n + k), where ni{x'^) denotes the number of outcomes in bin i) or on Bayesian 



prediction (a uniform prior for 9 also leads to this estimator IIYu and Speedlll992ll ). 

The minimax convergence rate in Cesaro mean for ^A* is of the order of n~^'^ JYu and SpeedLll992 . 



Theorems 3.1 and 4.llli which is equivalent to the statement that 

G,,,,.fix(n) < n^l\ (22) 

This rate is achieved up to a multiplicative constant by the model selection criterion 6{x'"') = [n^/^], 
which, irrespective of the observed data, uses the histogram model with [n^''^] bins to predict Xn+i 
llRissanenetal.[ll992ll : 

n 

sup y^n{P\Pr,,ri.)<v}'\ (23) 

The optimal rate in Cesaro mean is also achieved (up to a multiplicative constant) by the switch distribution: 



Theorem 4. Let pi, p2, ... be histogram estimators as in (1211 ). and let ps^ denote the switch distribution 
relative to these estimators with prior that satisfies the conditions in (1171 ). Then 

n 

G.sw(n)= sup Vri(P*,P,w)r<ni/3. (24) 

P*£M* ~l 

4.3.1 Comparison of the Switch Distribution to Other Estimators 

To return to the question of choosing the number of histogram bins, we will now first compare the switch 
distribution to the minimax optimal model selection criterion 5, which selects \n^/^~\ bins. We will then also 
compare it to Bayes factors model selection and Bayesian model averaging. 



'We note that IYu and Speedll 199211 reproduces part of Theorem 1 from IRissanen et al.Lll992ll without the (necessary) condition 
that Co < 1 < ci . 
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Although 5 achieves the minimax convergence rate in Cesaro mean, it has two disadvantages compared 
to the switch distribution: The first is that, in contrast to the switch distribution, 6 is inconsistent. For 
example, if Xi, X2, . . . are i.i.d. according to the uniform distribution, then 5 still selects [n^/'^] bins, while 
model selection based on Pgw will correctly select the 1-bin histogram for all large n. Experiments with 
simulated data confirm that Pgw already prefers the 1-bin histogram at quite small sample sizes. The other 
disadvantage is that if we are lucky enough to be in a scenario where P* actually allows a. faster than the 
minimax convergence rate by letting the number of bins grow as n'^ for some 7 7^ |, the switch distribution 
would be able to take advantage of this whereas 6 cannot. Our experiments with simulated data confirm 
that, if P* has a sufficiently smooth density, then it predictively outperforms 6 by a. wide margin. 

To achieve consistency one might also const ruct a Bayesian estimator based on a prior distribution on the 
number of bins. However. I Yu and SpeedLll992L Theorem 2.4] suggests that Bayesian model averaging does 



not achieve the same ratq3^ but a rate of order n~^/'^(log n)^/'^ instead, which is equivalent to the statement 
that 

n 
sup Vri(P*,Pbma)><ni/3(logn)2/3. (25) 

P*€M* ~[ 

Bayesian model averaging will typically predict better than the single model selected by Bayes factor model 
selection. We should therefore not expect Bayes factor model selection to achieve the minimax rate either. 
While we have no formal proof that standard Bayesian model averaging behaves like (1251 ). we have also 
performed numerous empirical experiments which all confirm that Bayes performs significantly worse than 
the switch distribution. We will report on these and the other aforementioned experiments elsewhere. 

What causes this Bayesian inefficiency? Our explanation is that, as the sample size increases, the catch- 
up phenomenon occurs at each time that switching to a larger number of bins is required. Just like in the 
shaded region in Figure [H this causes Bayes to make suboptimal predictions for a while after each switch. 
This explanation is supported by the fact that the switch distribution, which has been designed with the 
catch-up phenomenon in mind, does not suffer from the same inefficiency, but achieves the minimax rate in 
Cesaro mean. 

5 Risk Convergence Rates, Advanced Results 

In this section we develop the theoretical results needed to prove minimax convergence results for the switch 
distribution. First, in Section ISTTl we define the convenient concept of an oracle and show that the switch 
distribution converges at least as fast as oracles that do not switch too often as the sample size increases. 
In order to extend the oracle results to convergence rate results, it is useful to restrict ourselves to model 
classes A4* of the "standard" type that is usually considered in the nonparametric literature. Essentially, 
this amounts to imposing an independence assumption and the assumption that the convergence rate is of 
order at least n~^ for some 7 < 1. In Section [5^ we define such standard nonparametric classes formally, 
we explain in detail how their Cesaro convergence rate relates to their standard convergence rate, and we 
provide our main lemma, which shows that, for standard nonparametric classes, Pg^ achieves the minimax 
rate under a rather weak condition. In Section [53] and [S!4l we apply this lemma to show that Pgw achieves the 
minimax rates in some concrete nonparametric settings: density estimation based on exponential families 
and linear regression. Finally, Section [53] briefly considers the parametric case. 



^In the left-hand side of (iii) in Theorem 2.4 of lYu and Speed , 1 199211 the division by n is missing. (See its proof on p. 203 of 
that paper.) 
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To get an intuitive idea of how the switch distribution avoids the catch-up phenomenon, it is essential to 
look at the proofs of some of the results in this section, in particular LemmalSl [6l [8l [TOl and [T3l Therefore, 
the proofs of these lemmas have been kept in the main text. 

5.1 Oracle Convergence Rates 

Let M*, Ml, M2, ■ ■ ■ and Pi, P2, • • • be as in Section l4~Tm As a technical tool, it will be useful to 
compare the cumulative risk of the switch distribution to that of an oracle prediction strategy that knows 
the true distribution P* € M.* , but is restricted to switching between Pi, P2, .... Lemma [5] below gives 
an upper bound on the additional cumulative risk of the switch distribution compared to such an oracle. To 
bound the rate of convergence in Cesaro mean for various nonparametric model classes we also formulate 
Lemma [6l which is a direct consequence of Lemma [51 Lemma [6] will serve as a basis for further rate of 
convergence results in Sections [5^5.41 

Definition 3 (Oracle). An oracle is a function uj : M* ^ U^o '^" ~^ ^^ ^^^^^ ^^^ ^^^ n G N, given not 
only the observed data x*^ G X^, but also the true distribution P* G J\A* , selects a model index, uj{P*,x^), 
with the purpose of predicting X„+i by P^(X„+i | x") = P^/p*^^n\{Xn+i \ x"). 

If u!{P*,x"') = uj{P*,y"') for any x", y" G X"- (i.e. the oracle's choices do not depend on x", but only 
on n), we will say that oracle co does not look at the data and write Lo{P*,n) instead of a;(P*, x") for some 
arbitrary x" G A"". 

Suppose Lo is an oracle and Xi, X2, ... are distributed according to P* G M*. If X"^^ = x"^^, then 
w (P* , x*^ ) , . . . , a; (x"^~ ^ ) is the sequence of model indices chosen by to to predict Xi, . . ., X„ . We may split 
this sequence into segments where the same model is chosen. Let us define rriojin) as the maximum number 
of such distinct segments over all P* G A4* and all x"^^ G Af"^^. That is, let 

mjn)= max max \{1 < i < n - 1 : uj(P* ,x'~^) ^ uj(P* ,x')}\ + 1, (26) 

where x* denotes the prefix of x"~^ of length i. (The maximum always exists, because for any P* and x"~^ 
the number of segments is at most n.) 

The following lemma expresses that any oracle uj that does not select overly complex models, can 
be approximated by the switch distribution with a maximum additional risk that depends on muj{n), its 
maximum number of segments. We will typically be interested in oracles uj such that this maximum is small 
in comparison to the sample size, n. The lemma is a tool in establishing the minimax convergence rates of 
Psw that we consider in the following sections. 

Lemma 5 (Oracle Approximation Lemma). Let Pjw be the switch distribution, defined with respect to a 
sequence of estimators Pi, P2, ... as introduced above, with any prior vr that satisfies the conditions in (1171) 
and let P* £ M*. Suppose g : Z+ ^Mis a positive, nondecreasing function and uj is an oracle such that 

uj{P*,x'-^)<g{i) (27) 

for all i G Z+, all x''^ G X^-^. Then 

n n 

J^r,(P*,Psw) = Y,n{P\P^)+0{m^{n) ■ {logn + log g{n))) , (28) 

where the constants in the big-0 notation depend only on the constants implicit in (1171 ). 
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Proof. Using (fT3] ) we can rewrite (l28l) into the equivalent claim 



E 



log 



Rsw(^") 



0(m^(n) • (log n + log 5(n))), (29) 



which we will proceed to prove. For all n, x" G Af", there exists an s € S with m{s) < m^{n) and 
*m(s)(s) < "^ that selects the same sequence of models as lo to predict x", so that qs{x^ \ x*~^) = Puji{x^ \ 
x*~^) for 1 < i < n. Consequently, we can bound 

Psw(x") = ^gs'(x")-vr(s') > gs(x")^(s) = p.(x")^(s). (30) 

s'es 

By assumption (|27] ) we have that a;, and therefore s, never selects a model tM/j with index k larger than g{i) 
to predict the ith outcome. Together with dTTl ) and the fact that ^r is nondecreasing, this implies that 

m(s) 

-Iog7r(s) = -log7r„(m(s)) - log7rK(/i:i(s)) + ^ - log7r,(tj(s) [ tj-i(s)) - log7rK(A;j(s)) 

m(s) 

= 0(m(s)) + ^0(logtj(s))+0(logA:,(s)) 
i=i 

?Ti(s) 

= 0(m(s)) + ^ 0(logt,(s)) + 0{logg{tj{s) + 1)) 
i=i 

m.(s) 
= 0(m(s)) + ^ O(logn) + 0(log5'(n)) = 0{m^{n) ■ (logn + logg{n))), (31) 

where the constants in the big-0 in the final expression depend only on the constants in ([17] ). Together (l30l ) 
and (OTI ) imply (l29l ). which was to be shown. D 



From an information theoretic point of view, the additional risk of the switch distribution compared 
to oracle u may be interpreted as the number of bits required to encode how the oracle switches between 
models. 

In typical applications, we use oracles that achieve the minimax rate, and that are such that the number 
of segments m^^ (n) is logarithmic in n, and lu never selects a model index larger than n'^ for some r > 
(typically, r < 1 but some of our results allow larger r as well). By Lemma [51 the additional risk of the 
switch distribution over such an oracle is 0((log n)^). In nonparametric settings, the minimax rate satisfies 
G'mm-fix(^) ^ n^~^ for some 7 < 1. This indicates that, for large n, the additional risk of the switch 
distribution over a sporadically switching oracle becomes negligible. This is the basic idea that underlies the 
nonparametric minimax convergence rate results of Section [s!2ll5.4[ Rather than using Lemma [5] directly to 
prove such results, it is more convenient to use its straightforward extension Lemma [6] below, which bounds 
the worst-case cumulative risk of the switch distribution in terms of the worst-case cumulative risk of an 
oracle, cu: 

n 
GUn)= sup S2ri{P*,P^). (32) 
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Lemma 6 (Rate of Convergence Lemma). Let P^^ be the switch distribution, defined with respect to a 
sequence of estimators Pi , P2 ; • • • ci^ above, with any prior vr that satisfies the conditions in (I17I ). Let 
f : Z~^ ^ Mbe a nonnegative function and let Ai* be a set of distributions on X^. Suppose there exist a 
positive, nondecreasing function g : Z^ -^ M, an oracle to, and constants ci , C2 > such that 

(i) Lo{P*,x'-'^) < g{i) (for all i G Z+, x^-^ G X^-\ and P* G M*), 

(ii) m^{n)[logn + log g{n)) ^ca /(n) 

(Hi) Gu>{n) ^d /(n) 

Then there exists a constant C3 > such that Gf,v/{n) ^ci+c2C3 fin). 

Proof. By Lemma[5]we have that Gs^f/{n) = Guj{n) + 0(r7T,t^(n) • (logn + log(^(n))). Therefore there 
exists a constant C3 > such that 

,. GswH ^ .. Gujin) + C3-m^{n)- {log n + log g{n)) 

hm sup < hm sup -— < ci + C2 • C3, 

n—>oo J\^) n— >oo J v^) 

where the second inequahty follows from Conditions In] en [ml D 

Note that Condition|ii]is satisfied with C2 = iff mi^(n)( logn + log (/(n)) = o{f(n)). In the following 
subsections, we prove that Pjw achieves Gmm-{ix{ri) relative to various parametric and nonparametric model 
classes Ai* and M-. The proofs are invariably based on applying Lemma [6l Also, the proof of Theorem |4] 
is based on Lemma [6l The general idea is to apply the lemma with /(n) equal to the summed minimax 
risk Gnim-fix('^) (see (fT6l)). If, for a given model class A4*, one can exhibit an oracle uj that only switches 
sporadically (Condition (ii) of the lemma) and that achieves Gmm-fix('^) (Condition (iii)), then the lemma 
implies that Psw achieves the minimax rate as well. 

5.2 Standard Nonparametric Model Classes 

In this section we define "standard nonparametric model classes", and we present our main lemma, which 
shows that, for such classes, Pgw achieves the minimax rate under a rather weak condition. Standard non- 
parametric classes are defined in terms of the (standard, non-Cesaro) minimax rate. Before we give a precise 
definition of standard nonparametric, it is useful to compare the standard rate to the Cesaro-rate. For given 
Ai*, the standard minimax rate is defined as 

gmm{n)=mf sup rniP*,P), (33) 

P p*(zM* 

where the infimum is over all possible estimators, as defined in Section [2?Tl P is not required to lie in A^* or 
M.. If an estimator achieves (l33l) to within a constant factor, we say that it converges at the minimax optimal 
rate. Such an estimator will also achieve the minimax cumulative risk for varying P*, defined as 

n n 

Gmm-var(ra) = V'smml^) =inf V] SUp ri{P*,P), (34) 



/ ^111111 V - / -^^^ , ^ ^j^ 



where the infimum is again over all possible estimators. 

In many nonparametric density estimation and regression pro^blems, the rn i nimax risk qrnm {n\ is of order 



n ^ for some 1/2 < 7 < 1 (see, for example, JYang and BarronLll998lll999l . lBarron and SheuLll99lll ). i.e. 
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gmm{n) ^ n '*', where 7 depends on the smoothness assumptions on the densities m M* . In this case, we 
have 

Gmm-var(ra) X ^i"'^ X / X'^dxXn^-^. (35) 

Similarly, in standard parametric problems, the minimax risk gmm('^) ^ 1/n. In that case, analogously to 
(|35] ). we see that the minimax cumulative risk Gmm-var is of order log n. 

Note, however, that our previous result for histograms (and, more generally, all results we are about to 
present), is based on a scenario where P* , while allowed to depend on n, is kept fixed over the terms in 
the sum from 1 to n. Indeed, in Theorem |4] we showed that P^^ achieves the minimax rate Gmm-fix(^) as 
defined in (IT6l ). Comparing to (l34l) . we see that the supremum is moved outside of the sum. Fortunately, 
Gmm-fix and Gmm-vm are usually of the same order: in the parametric case, e.g. M* = Ufc<fe* -^fc. both 
Gmm-fix and Gmm-vw are of order logn. For Cmm-var, we have already seen th is. For Gmm-fix, this is a 



standard information-theoretic result, see for example JClarke and Barron. 1 19901] . In a variety of standard 
nonparametric situations that are studied in the literature, we have Gmm-\av{n) >i Gmm-fix('T') as well. Before 
showing this, we first define what we mean by "standard nonparametric situations": 

Definition 4 (Standard Nonparametric). We call a model class A4* standard nonparametric if 

1. For any P* E M.*, the random variables Xi, X2, ... are independent and identically distributed 
whenever X°° ~ P*, and P*{Xi) has a density (relative to the Lebesgue or counting measure); and 

2. The minimax convergence rate, gmm{n), relative to M* does not decrease too fast in the sense that, 
for some < 7 < 1, some nondecreasing function hQ{n) = o{rC), it holds that 

gmm{n) >c n~^ho{n). (36) 

Examples of standard nonparametric A4* include cases with gmmin) x n~^ (in that case /io(n) = 1), 
or, more generally, (7mm (n) x n~"(logn)^ fo r some a £ (0,1), /3 € M (take 7 > a and hQ{n) 



7-a/ 



(log n)^; note that /? may be negative); see JYang and Barronl.ll999r . While in Lemma[6]there are nei- 



n 

ther independence nor convergence rate assumptions, in the next section we develop extensions of Lemma[6] 

and Theorem m that do restrict attention to such "standard nonparametric" model classes. 

Proposition 7. For all standard nonparametric model classes, it holds that Gmm-fix('^) ^ Gram-vaxin). 

Summarizing, both in standard parametric and nonparametric cases, Gmm-fix and Gmm-var are of compa- 
rable size. Therefore, Lemma[5]and[6]do suggest that, both in standard parametric and nonparametric cases, 
Psw achieves the minimax convergence rate Gmm-fix- In particular, this will hold if there exists an oracle 
uj which achieves the minimax convergence rate, but which, at the same time, switches only sporadically. 
However, the existence of such an oracle is often hard to show directly. Rather than applying Lemma [6] 
directly, it is therefore often more convenient to use Lemma [8] below, whose proof is based on Lemma [6l 
Lemma [8] gives a sufficient condition for achieving the minimax rate that is easy to establish for several 
standard nonparametric model classes: If there exists an oracle uj that achieves the minimax rate, such that 
all oracles uj' that lag a little behind uj achieve the minimax rate as well, then Pgw must achieve the minimax 
rate as well. Here "lags a little behind" means that the model chosen by uj' at sample size n was chosen by 
u; at a somewhat earlier sample size. Formally, we fix some constants a > 1 and c > 0. Suppose that, for 
some oracles uj and uj' , we have, for all P* G M*, n G Z^ and x"^^ G X"-^^, 

a;'(P*, x"-i) G {uj{P*,x'-^) \ i G [n/a, n] n N} , 
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where x^~^ denotes the prefix of x"'~^ of length i — 1. In such a case we say that u' lags behind uj by at 
most a factor of a. Intuitively this means that, at each sample size n, uj' may choose any of the models that 
was chosen by uj at sample size between n/a and n. We call an oracle u finite relative to A4* if for all n, 

SUPp.gyVi* rn{P*,Poj) < OO. 

Lemma 8 (Standard Nonparametric Lemma). Suppose Pi, P2, ■ ■ ■ are estimators and Psw is the correspond- 
ing switch distribution with prior that satisfies (I17I ). Let M* be a standard nonparametric model class. Let 
T > be a constant, and let lu be an oracle such that lo{P*,x'^~^) < n'^ for all P* G A^*, n E Z^ and 
x"~^ € Af"^^. Suppose that any oracle to' that lags behind uj by at most a factor of a > 1, is finite relative 
to A4*, and achieves the minimax convergence rate up to a multiplicative constant c > 0: 

sup rniP*,Pu;') ^cgmm{n). (37) 

p*eM* 

Then the switch distribution achieves the minimax risk in Cesciro mean up to a multiplicative constant: 

Gsv^{n) <c-c' G'mm.fix(n), (38) 

where r' — lim sun G'mm-Yar(w) 

wnere c - nmsup^^o^ G^^^n)- 

Proof. Let tj = \a^~^^^ — 1 for j G Z+ be a sequence of switch-points that are exponentially far apart, and 
define an oracle uj' as follows: For any n G Z+, find j such that n G [tj + 1, tj+i] and let lij'(P*, x"~^) := 
uj{P*,x^^) for any P* G M* and any x"^^ G X"^^^. If we can apply Lemma |6] for oracle uj' , with 
/(n) = Gnim-fix(^). gin) = vJ , c\ = c- d and C2 = 0, we will obtain (|38] ). It remains to show that in this 
case conditions (i)-(iii) of Lemma |6] are satisfied. 

As to condition (Ql: w'(P*,x"~^) = uj{P*,x^^) < {tj + 1)^ < n"^. Condition © is also satisfied, 
because m^i (n) < [log^ n] + 2, which imphes 

m^i (n) {log n + log g{n)) < ([log„ n] + 2)(logn + logn"^) 

for some 7 < 1, where we used that, because A4* is standard nonparametric, both (1351 ) and Proposition |7] 
hold. To verify condition (|iiil) . first note that by choice of the switch-points, uj' {P* , x"~^) = uj{P* ,x^^) 
with tj + 1 G [n/a, n] and therefore uj' satisfies (|37] ) by assumption. Since uj' is finite relative to M* , this 
implies that G^i{n) ■<c SILi 5mm(^) = G'mm-var("') and hence that 

Guj'{n) :<c Gmm-dxin)— ^—- -<cc' G^^.^JyU). D 

5.3 Example: Nonparametric Density Estimation with Exponential Families 

In many nonparametric situations, there exists an oracle uj that achieves the minimax convergence rate, 
which only selects a model based on the sample size an not on the observed data. This holds, for example . 



for density estimation based on sequences of exponential families as introduced by lBarron and Sheul 11199111 . 
ISheu [1990] under the assumption that the log density of the true distribution is in a Sobolev space. Not 
surprisingly, using Lemma[8l we can show that Psw achieves the minimax rate in the Barron-Sheu setting. 

Formally, let X = [0, 1], let r > 1 and let VFJ be the Sobolev space of functions / on ^ for which Z^*"^^) 
is absolutely continuous and Jif^^^x))"^ dx is finite. Here /(*") denotes the r-th derivative of /. Let M*^"^^ 
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be the model class such that for any P* € M.* the random variables Xi,X2, ■ ■ ■ aie i.i.d., and P*{Xi) has a 
density p* such that logp* € VF2 . We model JH*^^) using sequences of exponential families A4i,M.2, ■ ■ ■ 

{p0 [ ^ G M*^} be the A;-dimensional exponential family of densities on [0, 1] 



defined as follows. Let Aik 
with 



Pe{x) =po{x)ex.p < ^9k(t>k{x) - tpk{6) 



where V'fc(^) = log Jpo(2^)exp(^6'fc0fc(a:))dx. Here po is some reference density on [0, 1], taken with 
respect to Lebesgue measure. The density pe is extended to X°° by independence. We let 01,02, •• • 
be a countably infinite list of uniformly bounded, linearly independent functions, and we define S^ '■= 
{1, 01, ... , (pk}- We consider three possible choices for St- polynomials, trigonometric series and splines 
of order s with equally spaced knots. For example, we are allowed to choose 1, x, x^, ... in the polynomial 
case, or 1, cos(27rx), sin(27rx), . . ., cos(27r(A;/2)x), sin{2TT{k/2)x) in the trigonometric case. For precise 
conditions on the 0i, . . . , 0^ that are allowed in each case, we refer to IIBarron and Sheul.ll99lll . We equip 
A4k with a Gaussian prior density Wk{6), i.e. the parameters 6* € M^ are independent Gaussian random 
variables with mean and a fixed variance o"^. With each A4fc we associate the Bayesian MAP estimator 
Pq ,^ny where Ok{x"') := avgraaxQ^^k Pk,e{x"')wk{0). Defi ne the corres ponding prediction strategy Pk by 



its density Pk (x 



n+l 



X 



9kix"V 



x„+i). Theorem 3. 1 of iSheuLll990n (or rather its corollary on page 50 



of nSheul. 11990] ') states the following: 



Theorem 9 (Barron and Sheu). Let 0i, 02, . . . constit ute a basis of polynomial s, or trigonometric functions, 
or splines of some order s, satisfying the conditions of nBarron and SheuL \l99l\] . Let r > Sin the polynomial 
case, r >2 in the trigonometric case, and r = s, s > 2 in the spline case. Let kin) be an arbitrary function 

such that k{n) x n}'^'^'^'^^\ Then sup pt^j^*(r) rn{P*,Pk(n)) < 00, a«cf supp,g_yy^,{r) r„(P*,P^,(„)) x 

^-2r/{2r+l)_ 



The minimax convergence rate for the models Ad*^^' is given by g^ 



n 



n 



-2r/(2r+l) 



11 Yang and BarronL 



19981]. Thus, together with Lemma HI using the oracle w(P*,: 



n 



i/(2r+i)^ the theorem implies that 



Psw achieves the minimax convergence rate. We note that the paper JBarron and Sheul. 11991 1] only estab- 
lishes convergence of KL divergence in probability when maximum likelihood parameters our used. For our 
purposes, we need converg ence in expectation, which holds when IVIAP parameters are used, as shown in 
Sheu's thesis llSheull 19901] . Since the prediction strategies P^. are based on IVIAP estimators rather than on 
Bayes predictive distrib utions, our consistency result Theorem [T] of Section |3] does not apply. However, by 
Theorem 2.1 and 2.2 of JLi and Yull2000 ]. we can apply the alternative consistency result Theorem|2] Thus, 
just as for histogram density estimation as discussed in Section 1431 we do have a proof of both consistency 
and minimax rate of convergence for general nonparametric density estimation with exponential families. 



5.4 Example: Nonparametric Linear Regression 

5.4.1 Lemma for Plug-In Estimators 

We first need a variation of Lemma[8]for the case that the P^ are plug-in strategies. We will then apply the 
lemma to nonparametric linear regression with Pj. based on maximum likelihood estimators within A^^. To 
prepare for this, it is useful to rename the observations to Z, rather than Xi. 

As before, we assume that Zi, Z2, ... are i.i.d. according to all P* G M* and P G Ai. We write 
D{P*\\P) for the KL divergence between P* and P on a single outcome, i.e. D{P*\\P) := D{P*(Zi = 
■) \\P{Zi = ■)). For given P*, let, if it exists, P^ be the unique P G Mk achieving minpg_A4fe D{P* \\P). 
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Lemma 10. Let A4* be a standard nonparametric model class, and let Pi,P2, . . . be plug-in strategies, i.e. 

for all k, all n,z'^ £ A"", Pfc(^n+i = • | 2") G {-P(^i = •) I -^ ^ Mk}. Suppose that Mi,M2, ...are 
such that 

1. Pi, P2, ... all exist, 

2. For all n > l,k > 1, supp.g^, rn{P* , Pk) < 00, cind 

3. There exists an oracle to which achieves the minimax rate, i.e. sxi^p*^j^* fniP* \\Pij) ^ 5mm (^). such 
that to does not look at the data (in the sense ofSection \5.1i and uj{P*,n — 1) < nfor any P* G A4* 
and n G Z"^. 

4. Furthermore, define the estimation error ERR„(P*,Pfc) := rn{P*,Pk) — D{P*\\Pk), and suppose 
that for all k > 1, all n > k, 

ERR„„i(P*, Pk) > ERRn{P*,Pk). (39) 

Then Pg^ achieves the minimax rate in Cesciro mean, i.e. Gswin) :< Gmm-fLx{n). 

Proof. For arbitrary P* G A4* and fixed a > 1, let i^' be any oracle that does not depend on the data and 
that "lags a little behind to by at most a factor of a" in the sense of Lemma [8] For n such that n/a > 1, let 
1 < n' < n be such that uj{P*,n') = u;'{P*,n). Then 

r„(P*,/lO = D{P*\\P^>) + ERRn{P*,Pu.') 

= P»(P*||P^p,,„,)) +ERR„(P*,P^(P*,„,)) 

< P'(P*||P^p.,„,)) +ERR„,(P*,P^p.,„,)) 

^ gmm{n) ^ {n/ay"'ho{n) < gmm{n)- (40) 

Here B.RRn{P* , Pu)(p* y)) denotes the estimation error when, at sample size n, the strategy P^ with k = 
u{P*,n') is used. The last line follows because, by definition of standard nonparametric, ho is increasing. 
For n such that n/a > 1, (l40l ) in combination with condition 2 of the lemma (for smaller n) shows that we 
can apply Lemma[8l and then the result follows. D 

We call ERRn{P* ,Pk) "estimation error" since it can be rewritten as the expected additional logarithmic 
loss incurred when predicting Z„_|_i based on P^ rather than P^., the best approximation of P* within Aik- 

ERRn{P*,Pk) = Ezr.^P*Ez„+,^P*[-\ogPk{Zn+l j ^") - (- logPfc(Z„+i))]. 

As can be seen in the proof of LemmafTTlbelow. in the linear regression case, ERR„ {P*, Pk) can be rewritten 
as the variance of the estimator Pk, and thus coincides with the traditional definition of estimation error. 

In order to apply Lemma[TOl one needs to find an oracle that does not look at the data. A good candidate 
to check is the oracle 

w*(P*,n) =arg mill r„(P*,Pfc) (41) 

k 

because, as is immediately verified, if there exists an oracle to that does not look at the data and achieves the 
minimax rate, then to* must achieve the minimax rate as well. 
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5.4.2 Nonparametric Linear Regression 

We now apply Lemma[lO]to linear regression with random i.i.d. design and i.i.d. normally distributed noise 
with k nown varian ce o"^, using least -squares or, equivalently , maximum likelihood estimators (see Section 
6.2 of IIYangl. 1 199911 and Section 4 of [Yang and B arronll 199 81). The results below show that Pg^ achieves the 



minimax rate in nonparametric regression under a condition on the design distribution which we suspect to 
hold quite generally, but which is hard to verify. Therefore, unfortunately, our result has formal implications 
only for the restricted set of distributions for which the condition has been verified. We give examples of 
such sets below. 

Formally, we fix a sequence 0i , (/)2 , 03 , . . . of uniformly bounded, linearly independent functions from 
M to M. Let Sk be the space of functions spanned by cfii, . . . ,(j)k- The linear models A^^ are families 
of conditional distributions Pq for Fj G M given Xj € X, where X = [0, 1]°' for some d > 0. Here 
6 = (01, ... ,0k) G M'^ and Pg expresses that Yi = Ylj=i^j^ji-^i) + ^«' where the noise random 
variables Ui,U2, ■ ■ ■ are i.i.d. normally distributed with zero mean and fixed variance a"^. The predic- 
tion strategies Pi,P2,... are based on maximum likelihood estimators. Thus, for k < n, Pfc(y„_|_i | 
a;"+\y") := Pg(^„ ,^„)(y„+i | X„+i = a;„+i) where ^(x",y") G M'^ and Pg(^.„,^„) is the ML es- 
timator within Aik- For k > n, we may set Pfc(y„_|_i | x""*"^,!/") to any fixed distribution Q with 
supp,g_^. D{P*{Yn-\-i I Xn+i)\\Q{Yn+i \ Xn+i)) < oo. We dcnotc by <I)("''^) the k x n design matrix 
with the (j, i)— th entry given by (f)j{xi). 

We fix a set of candidate design distribution Vx and a set of candidate regression functions F* , and we 
let J\4* denote the set of distributions on (Xi, Yi), {X2, 12), • • • such that Xi are i.i.d. according to some 
Px ^'^X' ^'^'^ ^i = f*{^i) + Ui for some /* € T* and Ui,U2,. . . are i.i.d. normally distributed with zero 
mean and variance a"^. We assume that all /* € J^* can be expressed as 

r = J2^j^j (42) 

for some 61,62,... with limj^oo ^j = 0. It is immediate that for such combinations of M* and M-, 
condition 1 and 2 of Lemma [lO] hold. The following lemma shows that also condition 4 holds, and thus, if 
we can also verify that condition 3 holds, then Pgw achieves the minimax rate. 

Lemma 11. Suppose that A4i, A^2i • • • cif^ ci^ above. Let ^A* be as above, such that additionally, for all 
P* G M*, all n, all k G {1, . . . ,n}, the Fisher information matrix ((I)("'^))'r((j)(".'=)) is almost surely 
nonsingular Then Ii39\) holds. 

A sufficient condition for the required nonsingularity of ($("'*^))T($("''^)) is, for example, that for all 
P* G Ai*, the marginal distribution of X under P* has a density under Lebesgue measure. If the conditions 
of LemmafTTlhold and, additionally, we can show that some oracle achieves the minimax rate, then condition 
3 of Lemma [TO] is verified and Psw achieves the minimax rate as well. To verify whether this is the case, 
note that 

Proposition 12. Suppose that (a) for some a > 0, supp. D{P*\\Pk) x' fe-^a. ^f,) g^^{n) x' n-2a/{2a+i). 
and (c) for some T with l/(2a + l) < r < 1, we/zave ERR„(P*,Pjt) x' k/n, uniformly for k G {1, . . . ,ri^}. 
Then letting, for all P* G M*, io{P*,n) := [nV(2a+i)^^ we have supp,(,_M, rn{P* , Pc,) ^' gmm{n). 

Here a{n) <' h{n) means ''a{n) < b{n) and for all n, a{n) is finite", x' is defined in the same 
way. We omit the straightforward proof of Proposition [121 Conditions (a) and (b) hold for many natural 
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combinations of A^* and M-, under quite weak conditions on PJ JStonel Il982[1 . Possible Ai* include 
regression functions /* taken from Besov spaces and Sobolev spaces, and more generally cases where the 
(hj are 'full approximatio n sets of fu nctions" (w hich can be, e.g., polynomials, or trigonometric functions) 
I Yang and Barronl.ll998l . Section 4]. iCoxl.ll988l] shows that also (c) holds under some conditions, but these 
are relatively strong; e.g. it holds if Px is a beta-distribution and a = 1. We suspect that (c) holds in much 
more generality, but we have found no theorem that actually states this. Note that (c) in fact does hold, 
even with x replaced by =, if, after having ob served x", 17" , we evaluate P^ on a new X„+i-value which is 



chosen uniformly at random from xi, . . . ,Xn llYangl.ll999l] . But this is of no use to us, since all our proofs 
are ultimately based on the connection (IT3l) between the cumulative risk and the KL divergence. While this 
connection does not require data to be i.i.d., it does break down if we evaluate Pk on an X„_|_i -value that 
is not equal to the value of Xn+i that will actually be observed in the case that additional data are sampled 
from P*. Therefore, we cannot extend our results to deal with this alternative evaluation for which (c) 
holds automatically. All in all, we can sho w that the s witch distribution achieves the minimax rate in certain 
special cases (e.g. when the conditions of iCoxlll988n hold for -PJ^), but we conjecture that it holds in much 
more generality. 



5.5 The Parametric Case 

We end our treatment of convergence rates by considering the parametric case. Thus, in this subsection we 
assume that P* ^ Mk* for some k* G Z+, but we also consider that if A^i, A^2> • • • are of increasing com- 
plexity, then the catch-up phenomenon may occur, meaning that at small sample sizes, some estimator Pj^ 
with k < k* may achieve smaller risk than Pk* . In particii l ar, th is can happen if P* € M-k* , P* ^ M.k*-i, 
but D{P*\\^Ak*-l) := infpg;v4^,_^ is small. IVan ErvenI 120061 shows that in some scenarios, there exist 
i.i.d. sequences Xi, X2, . . . with P*{Xi) G Mk* for all i G Z+, such that limm_^oo D{PT. \\Mk*-i) = 



and lim^ 



,00 lim„- 



-^n \^tm.'\ 1 1 -Pbma) Rn {P('m) ' ^^ 



(m) 



(m)' 



00. That is, the difference in cumulative risk 



between Pjw and Pbma may become arbitrarily large if D^P^^WAik'-i) is chosen small enough. Thus, 
even in the parametric case Pbma is not always optimal: if P* G Mk* , then, as soon as we also put a positive 
prior weight on Pk*-i, Pbma may favour k* — I at sample sizes at which Pk* has already become the best 
predictor. The following lemma shows that in such cases the switch distribution remains optimal: the pre- 
dictive performance of the switch distribution is never much worse than the predictive performance of the 
best oracle that iterates through the models in order of increasing complexity. In order to extend this result 
to a formal proof that Pgw always achieves the minimax convergence rate, we would have to additionally 
show that there exist oracles of this kind that achieve the minimax convergence rate. Although we have no 
formal proof of this extension, it seems likely that this is the case. 

Lemma 13. Let Pj^ be the switch distribution, defined with respect to a sequence of estimators Pi,P2, . . . 
as above, with prior vr satisfying (1171 ). Let k* G Z+, and let lo be any oracle such that for any P* G M*, 
any x°° G X'^, the sequence oJi, uj2, . . . is nondecreasing and there exists some uq such that uOn = k* for 
all n > no, where Ui = uj{P*,x^~^)for all i. Then 



Gsw(n)-G^(n)< sup Vri(P*,P,w) - Vri(P*,P^) = F • O(logn). 



p-eM 



Consequently, ifG^{n) >z logn, then 



\i=l 



j=l 



GUn) < G^{n). 



(43) 



(44) 
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Proof. The inequality in (l43l) is a consequence of the general fact that sup^. /(x)— sup^. /'(x) < sup^(/(j;) — 
f'{x)) for any two functions / and /'. The second part of (l43l) follows by Lemma[5l applied with g{n) = k* , 
together with the observation that mi^{n) < k*. To show (l44l) we can apply Lemma [6] with g{n) = k* and 
/(n) = G^(n). (Condition Iml of the lemma is satisfied with ci = 1, and by assumption about Goj{n) there 
exists a constant C2 such that Condition [n] of the lemma is satisfied.) D 

The lemma shows that the additional cumulative risk of the switch distribution compared to P^^ is of 
order logn. In the parametric case, we usually have Gnim-fix("^) proportional to logn (Section [5^ . If that 
is the case, and if, as seems reasonable, there is an oracle lo that satisfies the given restrictions and that 
achieves summed risk proportional to Gmm.(ix(^). then also the switch distribution achieves a summed risk 
that is proportional to Gnim-fix(^)- 

6 Efficient Computation of tlie switcli distribution 

For priors vr as in Q, the posterior probability on predictors pi,p2, ■ ■ ■ can be efficiently computed sequen- 
tially, provided that vr^ (T = n | T > n) and vr^ can be calculated quickly (say in constant time) and that 
7r„{'m) = 0™(1 — 9) is geometric with parameter 6, as is also required for Theorem[I]and (see Section l4.1.4l ) 
permitted in the theorems and lemma's of Section|4]and[5] For example, we may take 7r^{k) = l/{k{k + 1)) 
and 7r,(n) = l/(n(n + 1)), such that ^^{T = n\T >n) = l/n. 



The algorithm resembles Fixed-Share IIHerbster and WarmuthL Il998n . but whereas Fixed-Share 



implicitly imposes a geometric distribution for tTt, we allow general priors by varying the shared weight 
with n. We also add the tt„ component of the prior, which is crucial for consistency. This addition ensures 
that the additional loss compared to the best prediction strategy that switches a finite number of times, does 
not grow with the sample size. 

To ensure finite running time, we need to restrict the switch distribution to switch between a finite 
number of prediction strategies. This is no strong restriction though, as we may just take the number of 
prediction strategies sufficiently large relative to N when computing p^^{x^). For example, consider the 
switch distribution that switches between prediction strategies pi, . . ., PKr„^^{N)- Then all the theorems in 
the paper can still be proved if we take i^max(-^) sufficiently large (e.g. iCmax(-^) > 9{^) would suffice for 
the oracle approximation lemma). 

This is a special case of a switch distribution that, at sample size n, allows switching only to pk such 

that k ^ K-n Q Z+, where ICi Q JC2 '^ We may view this as a restriction on the prior: 7r(S \ S') = 0, 

where 

S' := {s G S [ Vn e Z+ : Kn{s) G /C„} (45) 



denotes the set of allowed parameters, and, as in Section [ 

Kn{s) := ki{s) for the unique i such that tj(s) < n and i = m{s) V ti+i(s) > n (46) 

denotes which prediction strategy is used to predict outcome X„. 

The following online algorithm computes the switch distribution for any /Ci C /C2 C . . ., provided the 
prior is of the form Q. Let the indicator function, 1a{x), be 1 if a; G ^ and otherwise. 



23 



Algorithm 1 Switch(j;^) 

1 for k £ JCi do initialize w'j^ 
for n = l,. . . ,N do 

Report p,^{Kn,x''-^) = 



7r,{k) ■ 9; wl ^ 7r,{k) ■ {1 - 6) od 



2 
3 
4 
5 
6 
7 
8 
9 
10 
11 



w 



K„ 



+ w'L (a i^-sized array) 






for A; E /C„ do t(;^ ^ w^ • pfc(x„|x' 
pool ^ 7rT(Z = n I Z > n) • EfcG^„ w^! 
for k e /Cn+i do 



wl 



Wk ■Pk{xn\x'' ^) Od 



^k 



wl ■ lK„{k) ■ t^t{Z ^n\ Z >n) + pool • TT^{k) 
wl-lic„{k) + pool ■ Tr,{k) ■ {1 - 6) 



(loss update) 
(share update) 



od 



od 

Report Ps^{Kn+i,x 



N\ 



W 



Kna 



+ w 



Kn+1 



This algorithm can be used to obtain fast convergence in the sense of Sections H] and [H and consistency in 
the sense of Theorem [T] If ^^{T = n \ T > n) and vr^ can be computed in constant time, then its running 
time is Q{^n=i \^n\), which is typically of the same order as that of fast model selection criteria like AIC 
and BIC. For example, if the number of considered prediction strategies is fixed at Kmax then the running 
time is G(-K'max ■ -^)- 

Theorem 14. Let ps^ denote the switch distribution with prior vr. Suppose that vr is of the form ^ and 

7r(S \ §') = 0. Then Algorithm\l}correctly reports Psw(-f^i5 x^), . . ., Psw(-f^Af+ii a;^). 



Note that the posterior 7r(i^7v+i I x ) and the marginal likelihood p. 



„N\ 



can both be computed from 



Psv/(^Kj\f^i,x ) in 6([/CAr+i I) time. The theorem is proved in Appendix IA.7I 



7 Relevance and Earlier Work 

Over the last 25 years or so, the question whether to base model selection on AIC or BIC type methods 
has received a lot of attention in the theoretical and applied statistics literature, as well as in fields such 
as psychology and biology where mode l selection plays an irnportant role (googling "AIC" and "BIC" 
gives 355000 hits) JSoeed and YuL ll993L iHansen and YuL boOll. bool iBarron et all Il994 Iporsteil IJOOl . 
De Luna and SkourasLl2003LISoberl.l2004l] . It has even been suggested that, since these two types of methods 
have been designed with different goals in mind (optimal prediction vs. "truth hunting "), it rnay sirn ply be 
the case that no procedures exist that combine the best of both types of approaches [Sobei], 1200411 . Our 
Theorem [T] Theorem |4] and our results in Section |5] show that, at least in some cases, one can get the best of 
both worlds after all, and model averaging based on Pgw achieves the minimax optimal convergence rate. In 
typical parametric settings [P* E M), model selection based on Psw is consistent, and Lemma [13] suggests 
that model averaging based on Psw is within a constant factor of the minimax optimal rate in parametric 
settings. 



7.1 A Contradiction with Yang's Result? 



Superficially, our results may seem to contradict the central conclusion of Yang llYangl.l2005al] . Yang shows 
that there are scenarios in linear regression where no model selection or model combination criterion can be 
both consistent and achieve the minimax rate of convergence. 
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Yang's result is proved for a variation of linear regression in which the estimation error is measured 
on the previously observed design points. This setup cannot be directly embedded in our framework. Also, 
Yang's notion of model combination is somewhat different from the model averaging that is used to compute 
Psw Thus, formally, there is no contradiction between Yang's results and ours. Still, the setups are so similar 
that one can easily imagine a variation of Yang's result to hold in our setting as well. Thus, it is useful to 
analyze how these "almost" contradictory results may coexist. We suspect (but have no proof) that the 
underlying reason is the definition of our minimax convergence rate in Cesaro mean (IT6l ) in which P* is 
allowed to depend on n, but then the risk with respect to that same P* is summed over alH = 1, . . . , n. In 
contrast, Yang uses the standard definition of convergence rate, without summation. Yang's result holds in a 
parametric scenario, where there are two nested parametric models, and data are sampled from a distribution 
in one of them. Then both Gmm-fix and Gmm.var are of the same order log n. Even so, it may be possible that 
there does exist a minimax optimal procedure that is also consistent, relative to the Gmm-fix-game, in which 
P* is kept fixed once n has been determined, while there does not exist a minimax optimal procedure that is 
also consistent, relative to the Gmm.vai-game, in which P* is allowed to vary. We conjecture that this explains 
why Yang's result and ours can coexist: in parametric situations, there exist procedures (such as P^^) that 
are both consistent and achieve Gmm-fix. but there exist no procedures that are both consistent and achieve 
Gmm-var- We suspcct that the qualification "parametric" is essential here: indeed, we conjecture that in the 
standard nonparametric case, whenever P^^ achieves the fixed-P* minimax rate Gmm-fix. it also achieves the 
varying-P* minimax rate Gmm-var- The reason for this conjecture is that, under the standard nonparametric 
assumption, whenever Psw achieves Gmm-fix. a small modification of Pjw will achieve Gmm-var- Indeed, 
define the Cesaro-switch distribution as 

1 "^ 

-'Cesaro-swv^^jj X ) . — > r^\fi[Xn X ). i^'j 

n ^-^ 

Proposition 15. Pcesaio-sw achieves the varying-P* -minimax rate whenever P^^ achieves the fixed-P*- 
minimax rate. 

The proof of this proposition is similar to the proof of Proposition Hand can be found in Section IA31 
Since, intuitively, Pcesaro-sw learns "slower" than Pg^. we suspect that Pgw itself achieves the varying- 
P* -minimax rate as well in the standard nonparametric case. However, while in the nonparametric case, 
gmm{n) X Gmm-fix ("')/f^. in the parametric case, 5mm(ra) x 1/n whereas Gmm-fix(n)/n x (logn)/n. 
Then the reasoning underlying Proposition [T5] does not apply anymore, and Pcesaro-sw may not achieve the 
minimax rate for varying P*. Then also Pg^ itself may not achieve this rate. We suspect that this is not 
a coincidence: Yang's result suggests that indeed, in this parametric setting, Pgw. because it is consistent, 
cannot achieve this varying P* -minimax optimal rate. 

7.2 Earlier Approaches to the AIC-BIC Dilemma 

Several other authors have provided procedures which have been designed to behave like AIC whenever 
AIC is better, and lik e BIC whenever BIC is better; and which empirically seem to do so; these include 



model meta-select ion JDe Luna and SkqurasL l2003l . IClarkeL 119971] . and Hans en and Yu's gMDL y&xsio n of 



MDL regression JHansen and YuL 1200111 ; also the "mongrel" procedure of llWong and Clarkel 1200411 has 
been designed to improve on Bayesian model averaging for small samples. Compared to these other meth- 
ods, ours seems to be the first that provably is both consistent and minimax optimal in terms of risk, for 
some classes M*. The only other procedure that we kn ow of for whi ch somewhat related results have 



been shown, is a version of cross-validation proposed by lYangI ll2005bll to select between AIC and BIC 
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in regression problems. Yang shows that a particular form of cross-validation will asymptotically select 
AIC in case the use of AIC leads to better predictions, and BIC in the case that BIC leads to better predic- 
tions. In contrast to Yang, we use a single paradigm rather than a mix of several ones (such as AIC, BIC 
and cross-validation) - essentially our paradigm is just that of universal individual-sequence prediction, or 
equivalently, the individual-sequence version of predictive MDL, or equivalently, Dawid's prequential anal- 
ysis applied t o the log scoring rule. Indeed, our work has been heavily inspired by prequential ideas; in 



DawidI 11199211 it is already suggested that model selection should be based on the transient behaviours in 
terms of sequential prediction of the estimators within the models: one should select the model which is 
optimal at the given sample size, and this will change over time. Although Dawid uses standard Bayesian 
mixtures of parametric models as his running examples, he implicitly suggests that other ways (the details 
of which are left unspecified) of combining predictive distributions relative to parametric models may be 
preferable, especially in the nonparametric case where the true distribution is outside any of the parametric 
models under consideration. 

7.3 Prediction with Expert Advice 

Since the switch distribution has been designed to perform well in a setting where the optimal predic- 
tor pk changes over time, our work i s also closely related to the algorithms fo r tracking the best expert 
in the universal prediction liter ature JHerbster and WarmuthL ll998L IVov ki. ll999L IVolf and WillemsL Il998 . 



Monteleoni and Jaakkolal . 1200411 . However, those algorithms are usually intended for data that are sequen- 
tially generated by a mechanism whose behaviour changes over time. In sharp contrast, our switch distri- 
bution is especially suitable for situations where data are sampled from sl fixed (though perhaps non-i.i.d.) 
source after all; the fact that one model temporarily leads to better predictions than another is caused by the 
fact that each "expert" pk has itself already been designed as a universal predictor/estimator relative to some 
large set of distributions Mk- The elements of M.^ may be viewed as "base" predictors/experts, and the pk 
may be thought of as meta-experts/predictors. Because of this two-stage structure, which meta-predictor pk 
is best changes over time, even though the optimal base-predictor arg min g^ r„ {p* , p) does not change 
over time. 

If one of the considered prediction strategies p^ makes the best predictions eventually, our goal is to 
achieve consistent model selection: the total number of switches should also remain bounded. To this end 
we have defined the switch distribution such that positive prior probability is associated with switching 
finitely often and thereafter using pk for all further outcomes. We need this property to prove that our 
method is consistent. Other dynamic expert tracking algorithms, such as the Fixed-Share algorithm 
I Herbster and WarmuthL Il998n . have been designed with different goals in mind, and as such they do not 



have this property. Not surprisingly then, our results do not resemble any of the existing results in the 
"tracking"-literature. 

8 The Catch-Up Phenomenon, Bayes and Cross- Validation 

8.1 The Catch-Up Phenomenon is Unbelievable! (According to BMA) 

On page |2] we introduced the marginal Bayesian distribution pbma(a;") := ^^ w(k)pk{x^). If the distri- 
butions Pk are themselves Bayesian marginal distributions as in ([T]), then pbma may be interpreted as (the 
density corresponding to) a distribution on the data that reflects some prior beliefs about the domain that 
is being modelled, as represented by the priors w{k) and Wk{0). If w{k) and Wk{0) truly reflected some 
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decision-maker's a priori beliefs, then it is clear that the decision-maker would like to make sequential pre- 
dictions of Xn+i given X^ = x" based on pbma rather than on psw Indeed, as we now show, the catch-up 
phenomenon as depicted in Figure [T] is exceedingly unlikely to take place under pbma. and a priori a sub- 
jective Bayesian should be prepared to bet a lot of money that it does not occur. To see this, consider the 
no-h ypercompressi on inequality JGrilnwaldl 1200711 . versions of which are als o known as "Barron's ineq ual- 
ity" IIBarronL 1 198511 and "competitive optimality of the Shannon-Fano code" JCover and Thomasl.ll991l1 . It 



states that for any two distributions P and Q for X°°, the P-probability that Q outperforms P by A: bits or 
more when sequentially predicting Xi, X2, . . . is exponentially small in k: for each n. 



P{- log g(X") < - logp(X") -k)<2 



'k 



Plugging in pbma for P, and psw for q, we see that what happened in Figure [T] (psw outperforming pbma 
by about 40000 bits) is an event with probability no more than 2"^'^'^'''' according to pbma- Yet, in many 
practical situations, the catch-up phenomenon does occur and psw gains significantly compared to pbma- 
This can only be possible because either the models are wrong (clearly. The Picture of Dorian Gray has not 
been drawn randomly from a finite-order Markov chain), or because the priors are "wrong" in the sense that 
they somehow don't match the situation one is trying to model. For this reason, some subjective Bayesians, 
when we confronted them with the catch-up phenomenon, have argued that it is just a case of "garbage in, 
garbage out" (GIGO): when the phenomenon occurs, then, rather than using the switch distribution, one 
should reconsider the model(s) and prior(s) one wants to use, and, once one has found a superior model 
M' and prior w' , one should use pbma relative to M' and w' . Of course we agree that ;/ one can come up 
with better models, one should of course use them. Nevertheless, we strongly disagree with the GIGO point 
of view: We are convinced that in practice, "correct" priors may be impossible to obtain; similarly, people 
are forced to work with "wrong" models all the time. In such cases, rather than embarking on a potentially 
never-ending quest for better models, the hurried practitioner may often prefer to use the imperfect - yet still 
useful - models that he has available, in the best possible manner. And then it makes sense to use p^^ rather 
than the Bayesian pbma- the best one can hope for in general is to regard the distributions in one's models as 
prediction strategies, and try to predict as well as the best strategy contained in any of the models, and psw is 
better at this than Pbma- Indeed, the catch-up phenomenon raises some interesting questions for Bayes factor 
model selection: no matter what the prior is, by the no-hypercompression inequality above with p = pbma 
and q = psw. when comparing two models M.i and M.2, before seeing any data, a Bayesian always believes 
that the switch distribution will not substantially outperform pbma. which implies that a Bayesian cannot 
believe that, with non-negligible probability, a complex model p2 can at first predict substantially worse 
than a simple model pi and then, for large samples, can predict substantially better. Yet in practice, this 
happens all the time! 

8.2 Nonparametric Bayes 

A more interesting subjective Bayesian argument against the switch distribution would be that, in the non- 
parametric setting, the data are sampled from some P* E M.* \ M., and is not contained in any of the 
parametric models 7V4i, A^2i • • • Yet, under the standard hierarchical prior used inpbma (first a discrete prior 
on the model index, then a density on the model parameters), we have that with prior-probability 1, P* is 
"parametric", i.e. P* € M.k for some k. Thus, our prior distribution is not really suitable for the situation 
that we are trying to model in the nonparametric setting, and we should use a nonparametric prior instead. 
While we completely agree with this reasoning, we would immediately like to add that the question then be- 
comes: what nonparametric prior should one use? Nonparametric Bayes has become very popular in recent 
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years, and it often works surprisingly well. Still, its practical and theoretical performance strongly depends 
on the type of priors that are used, and it is often far from clear what prior to use in what situation. In some 
situations, so me nonparametric priors achieve optimal rates of convergence, but others can even make Bayes 
inconsistent IIDiaconis and Freedmanl. Il98a. iGriinwaldl . 120071] . The advantage of the switch distribution is 



that it does not require any difficult modeling decisions, but nevertheless under reasonable conditions it 
achieves the optimal rate of convergence in nonparametric settings, and, in the special case where one of the 
models on the list in fact approximates the true source extremely well, this model will in fact be identified 
(Theorem [U. In fact, one may think of ps^ as specifying a very special kind of nonparametric prior, and 
under this interpretation, our results are in complete agreement with the nonparametric Bayesian view. 

8.3 Leave-One-Out Cross-Validation 

From the other side of the spectrum, it has sometimes been argued that consistency is irrelevant, since 
in practical situations, the true distribution is never in any of the models under consideration. Thus, it is 
argued, one should use AlC-type methods such as leave-one-out cross-validation, because of their predictive 
optimality. We strongly disagree with this argument, for several reasons: first, in practical model selection 
problems, one is often interested in questions such as "does Y depend on feature X^ or not?" For example, 
M-k-i is a set of conditional distributions in which Y is independent of Xk, and A4k is a superset thereof 
in which Y can be dependent on Xj.. There are certainly real-life situations where some variable Xj is truly 
completely irrelevant for predicting Y, and it may be the primary goal of the scientist to find out whether or 
not this is the case. In such cases, we would hope our model selection criterion to select, for large n, A4fe_i 
rather than A4k, and the problem with the AlC-type methods is that, because of their inconsistency, they 
sometimes do not do this. In other words, we think that consistency does matter, and we regard it as a clear 
advantage of the switch distribution that it is consistent. 

A second advantage over leave-one-out cross-vali dation is that the switch distrib ution, like Bayesian 



methods, satisfies Dawid's weak prequential principle MDawidl 1 19921. IGriinwaldl 12007 1 : the switch distribu- 
tion assesses the quality of a predictor pk only in terms of the quality of predictions that were actually made. 
To apply LOO on a sample xi, . . . , x„, one needs to know the prediction for Xi given xi, . . . , Xi-i, but also 
Xj+i, . . . , Xn- In practice, these may be hard to compute, unknown or even unknowable. An example of 
the first are non-i.i.d. settings such as time series models. An example of the second is the case where the 
pk represent, for example, weather forecasters, or other predictors which have been designed to predict the 
future given the past. Actual weather forecasters use computer programs to predict the probability that it 
will rain the next day, given a plethora of data about air pressure, humidity, temperature etc. and the pattern 
of rain in the past days. It may simply be impossible to apply those programs in a way that they predict the 
probability of rain today, given data about tomorrow. 

9 Conclusion and Future Work 

We have identified the catch-up phenomenon as the underlying reason for the slow convergence of Bayesian 
model selection and averaging. Based on this, we have defined the switch distribution P^^, a modification of 
the Bayesian marginal distribution which is consistent, but also under broad conditions achieves a minimax 
optimal convergence rate, thus resolving the AIC-BIC dilemma. 

1. Since psw can be computed in practice, the approach can readily be tested with real and simulated 
data in both density estimation and regression problems. Initial results on simulated data, on which 
we will report elsewhere, give empirical evidence that psw behaves remarkably well in practice. Model 
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selection based on psw. like for pbma. typically identifies the true distribution at moderate sample sizes. 
Prediction and estimation based on Pgw is of comparable quality to leave-one-out cross-validation 
(LOO) and generally, in no experiment did we find that it behaved substantially worse than either 
LOO or AIC. 

2. It is an interesting open question whether there is an analogue of Lemma[6]and Theorem|4]for model 
selection rather than averaging. In other words, in settings such as histogram density estimation where 
model averaging based on the switch distribution achieves the minimax convergence rate, does model 
selection based on the switch distribution achieve it as well? For example, in Figure [H sequentially 
predicting by the Pk„+i that has maximum a posteriori probability (MAP) under the switch distribu- 
tion given data x", is only a few bits worse than predicting by model averaging based on the switch 
distribution, and still outperforms standard Bayesian model averaging by about 40 000 bits. In the 
experiments mentioned above, we invariably found that predicting by the MAP Pk„+i empirically 
converges at the same rate as using model averaging, i.e. predicting by Pg^. However, we have no 
proof that this really must always be the case. Analogous results in the MDL literature suggest that 
a theorem bounding the risk of switch-based model selection, i f it can be proved at all, would bound 



the squared Hellinger rather than the KL risk IIGrunwaldl. 120071 Chapter 15]. 



3. The way we defined Psw. it does not seem suitable for situations in which the number of considered 
models or model combinations is exponential in the sample size. Because of condition (i) in Lemma|6l 
our theoretical results do not cover this case eit her. Y et this case is highly important in practice, for 
example, in the subset selection problem [Yang. 1 19991] . It seems clear that the catch-up phenomenon 



can and will also occur in model selection problems of that type. Can our methods be adapted to 
this situation, while still keeping the computational complexity manageable? And what is the relation 
with the popular and computationally efficient Li-approaches to model selection IITibshiranil . 1 1996i1 ? 
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A Proofs 

A. 1 Proof of Theorem [D 

Let ?7n = {s S S [ Kn+i{s) ^ k*} denote the set of "bad" parameters s that select an incorrect model. It is 
sufficient to show that 

EsPf/ 7r(s)^,(X") 
lim ™'-^" // / — f = with P^* -probability 1. (48) 



n^oo 



E.es<^hs{X^ 
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To see this, first note that (l48l) is almost equivalent to ([H). The difference is that Pg. -probability has been 
replaced by P^. -probability. Now suppose the theorem is false. Then there exists a set of parameters 
$ C 6fc. with Wk*{^) > such that © does not hold for any 6* G $. But then by definition of Pk* we 
have a contradiction with (l48l) . 

To show (l48l) . let yl = {s G S : km{s) ^ k*} denote the set of parameters that are bad for all sufficiently 
large n. We observe that for each s' G Un there exists at least one element s G A that uses the same 
sequence of switch-points and predictors on the first n + 1 outcomes (this implies that Ki{s) = Ki{s') for 
i = 1, . . . ,n + 1) and has no switch-points beyond n (i.e. tm{^) < n). Consequently, either s' = s or 
s' £ Eg. Therefore 

Y, ^is')qAxn < ^(^(s)+vr(P,))g,(x«) < (1 + c) ^ 7r(s)g,(x"). (49) 

Defining the mixture r(x") = X^ssA '^i^)Qs{x^), we will show that 

lim — ; /_ .y„. =0 with Pfe. -probability 1. (50) 

n^oo 7r(s = (0, fc*)) -pfct^A") 

Using dmi and the fact that X^sgs '"'i^)Qs{x'^) > 7r(s = (0, k*)) ■ pk* (x"), this implies (l48l) . 

For all s G A and x*""*^^) G ^*"^(^\ by definition Qs{X^ ^-^{x*"^) equals Pj!c^(X^°°^^|x*'"), which 
is mutually singular with Pk*{X^ ^i\x^"^) by assumption. If A' is a separable metric space, which holds 
because Af C M'^ for some d G Z+, it can be shown that this conditional mutual singularity implies mu- 
tual singularity of Qs{X'^) and Pfc*(X°°). To see this for countable X, let Bj.t,n be any event such that 
Qs{B^tm jx*-) = 1 and Pk*{B^trr, Ix*") = 0. Then, for B = {y°° G X°° \ y^^^ G Bytrr,}, we have that 
Qs{B) = 1 and Pk*{B) = 0. In the uncountable case, however, B may not be measurable. In that case, 
the proof follows by Corollary [T7] proved in Section IA31 Any countable mixture of distributions that are 
mu tually singular with Pk* , in particular R, is mutually singular with Pfc. . This implies (l50b by Lemma 3. 1 



of JBarronL 1198511 . which says that for any two mutually singular distributions R and P, the density ratio 



r{X"')/p{X"') goes to zero as n — > oo with P-probability 1. D 

A.2 Proof of Theorem H 

The proof is almost identical to the proof of Theorem [T] Let C/n = {s G S | Kn+i{s) ^ k*} denote the set 
of "bad" parameters s that select an incorrect model. It is sufficient to show that 

li™ V- /\ ' ; = with Pfc«, -probability 1. (51) 



n^oo 



Note that the qs in dSTT l are defined relative to the non-Bayesian estimators pi,p2, ■ ■ ., whereas the P^* on 
the right of the equation is the probability according to a Bayesian marginal distribution P^, , which has been 
chosen so that the theorem's condition holds. To see that dSTT ) is sufficient to prove the theorem, suppose the 
theorem is false. Then, because the prior Wk* is mutually absolutely continuous with Lebesgue measure, 
there exists a set of parameters <1> C 0;,. with nonzero prior measure under Wk* , such that (flOl ) does not 
hold for any 9* G $. But then by definition of P^* we have a contradiction with ( |5T| ). 

Using exactly the same reasoning as in the proof of Theorem [U it follows that, analogously to (l50l ). we 
have 

1™ -( ^n l*^^ -B i^fn^ = wi* Pfc^. -probability 1. (52) 

n.-^oo 7r(s = (0, A;*)) -p^, (A") 
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This is just (I50b with r now referring to a mixture of combinator prediction strategies defined relative to the 
non-Bayesian estimators pi,p2, ■ ■ ., and the p^, in the denominator and on the right referring to the Bayesian 
marginal distribution P^, . Using ( |49l ) and the fact that X^sgs '^i^)Qs{x"') > 7r(s = (0, k*)) ■ pk* (x"), and 
the fact that, by assumption, for some K, for all large n, pk*{X'^) > p|*(X'^)2~^^ with P^, -probability 1, 
implies ([ST]). D 



A.3 Mutual Singularity as Used in the Proof of Theorem [T] 

Let y^ = (Yi, Y2) be random variables that take values in separable metric spaces J7i and Q2, respectively. 
We will assume all spaces to be equipped with Borel a-algebras generated by the open sets. Let p be a 
prediction strategy for Y'^ with corresponding distributions P{Yi) and, for any y^ e $7i, P(Y2\y^). To 
ensure that P(y^) is well-defined, we impose the requirement that for any fixed measurable event A2 Q 0,2 
the probability P{A2\y^) is a measurable function of y^. 

Lemma 16. Suppose p and q are prediction strategies for Y'^ = (Yi, Y2), which take values in separable 
metric spaces i7i and VL2, respectively. Then if P{Y2\y^) and Q(l2|y^) (ire mutually singular for all y^ € 
ill, then P{Y'^) and Q(Y''^) are mutually singular. 

The proof, due to Peter Harremoes, is given below the following corollary, which is what we are really 
interested in. Let X°° = Xi, X2, . . .he. random variables that take values in the separable metric space X. 
Then what we need in the proof of Theorem [T] is the following corollary of Lemma [T6l 

Corollary 17. Suppose p and q are prediction strategies for the sequence of random variables X°° = Xi, 
X2, . . . that take values in respective separable metric spaces X\, X2, . . . Let m be any positive integer. Then 
//■p(X^+i|x™) and Q(X^+Jx'") are mutually singular for all x^ G A"", then P{X°°) and Q(X°°) are 
mutually singular 



Proof . The product spaces ^1 x • • • x Xm and X^+i x Xm+2 x • • • are separable metric spaces jParthasarathyl . 



19671 . pp. 5,6]. Now apply Lemma [T6l with Qi = Xi x ■ ■ ■ x Xm and O2 = ^m+i x '^m+2 x ■ ■ ■ . D 



Proof of Lemma [76l For each cui G ili, by mutual singularity of P(y2|'^i) and Q{Y2\iO\) there exists a 
meas urable set C^,^ ^ r?2 s uch that PiC^jx I'^i) = 1 ^'^d QiC^j^ Yo\) = 0. As Vi2 is a metric space, it follows 



from llParthasarathyl . 119671. Theorems 1.1 and 1.2 in Chapter II] that for any e > there exists an open set 



U^^ D Cj^i such that 



P([/^iL^i) = l and Q(C/^ Ic^i) < e. (53) 



As 1^2 is a separable metric space, there also exists a countable sequence {-B,}j>i of ope n sets such that 



every open subset of ^2 (f/^j in particular) can be expressed as the union of sets from {Bi\ llParthasarathyl . 



19671 . Theorem 1.8 in Chapter I]. 

Let {-Bj'}i>i denote a subsequence of {Bi\ such that U^^^ = |J^ B[. Suppose {B[} is a finite sequence. 
Thenletyj^ = C/^^. Suppose it is not. Then 1 = P([/^Jt^i) = P(U=i^.'ki) = lim^^oo ^dJILi ^Il^i). 
because IJILi ^'i ^^ ^ function of n is an increasing sequence of sets. Consequently, there exists an N such 
that P(UiIi B[\uji) > 1 - e and we let V^^ = |J^^ B[. Thus in any case there exists a set V^^ C U^^ that 
is a union of a finite number of elements in {Bi} such that 

P(K:ja;i) > 1 - e and Q(yjja;i) < e. (54) 
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Let {D}i>i denote an enumeration of all possible unions of a finite number of elements in {Bi} and 
define the disjoint sequence of sets {^^}i>i by 



Al = {oji G !^i : P(Al^i) > 1 - e,Q(Aki) < e} \ |J ^, 



(55) 



for i = 1, 2, . . . Note that, by the reasoning above, for each lui € ili there exists an i such that loi e Aj, 
which implies that {^^j forms a partition of ili. Now, as all elements of {Aj} and {A} are measurable, 
so is the set F^ = Uj=i Aj x Di C ^i x O2, for which we have that P{F^) = Y.Zi -P(A' x A) > 
(1 - e) E£i Pi^i) = 1 - e and likewise Q{F') < e. 



Finally, let G = f]n=iUT=nP ■ Then P{G) = lim„^oo P(Ur=n ^'" ) > lim„^oo 1 - 2" 



andQ(G) = lim„^oo Q(Ufe=n^'" ) < lim„^oo Er=n 2"' 
lemma. 



lim„^oo 2" 



= 1 

"""''^ = 0, which proves the 

D 



A.4 Proofs of Section H 
A.4.1 Proof of Lemma |3] 

For the first part we underestimate sums: 

Psw(x")= Y. E '?s(x")7r(s) > 7r„(l). ^ vr,(A;')Pfe'(x") = vr„(l) • pbma(x"), 

meZ+ sgS:m(s)=m fc'eZ+ 

Pbma(x") = J]] pfc/(x")7r,(A;') > Tr^{k)pk{x''). 

k'ez+ 



We apply (1131) to bound the difference in cumulative risk from above: 



i=l 
n 

Y,ri{P\P^,ra.)=E 



i=l 



log 
log 



p*(X") " 

Psw(^")_ 



< £; 



< £; 



log 
log 



p*(X") 



VrM(l)pbma(-'^" 



TT,{k)Pk{X^) 



^ri(P*,Pbma)-logvr„(l), 
^ri(P*,Pfc) -log7r,(A;). D 



j=i 



A.4.2 Proof of Theorem H 



We will prove a slightly stronger version of the theorem, whi ch shows that the swi tch distribution in fact 
achieves the same multiplicative constant. A, as is shown in JRissanen et all 1199211 for the estimator that 
selects [n^'^] bins: 



sup ^ri(P*,P,w)^iA 



n 



1/3 



(56) 



P*&M 



i=l 



The idea of the proof is to exhibit an oracle that closely approximates the estimator -Pr^i/a-i , but only 
switch es a logarithmic numb er of times in n on the first n outcomes, and then apply Lemma|6]to this oracle. 

In JRissanen et al.Lll992 1 Equation|23]is proved from the following theorem, which gives an upper bound 
on the risk of any prediction strategy that uses a histogram model with approximately [n^/'^] bins to predict 
outcome Xn+i- 
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Theorem 18. For any a > 1 



max sup rJP*,Pk)<ia^'^Cn-^/^, 

|"(n/o)i/3]<fc<[„i/3] p*(zM' 



(57) 



where C > depends only on C2 in (I20I ). 



In IIRissanen et all 119921] the theorem is only proved for q = 1, but their proof remains valid for any 
a > 1. From this, (1231 ) follows by summing (|57] ) and approximating X^ILi ^"^'''^ by an integral. Summation 
is allowed, because ri{P*,Pk) is finite for all P* G M*, i and k, and a^/^C^]"^]^ i~^/^ ^ oo as n 
goes to infinity. The constant A in (l23l) is the product of C and the approximation error of this integral 
approximation. We will now apply Theorem[T8]to prove Theorem|4]as well. 

Let a > 1 be arbitrary and let tj = [a-'"^] — 1 for j G Z+ be a sequence of switch-points. For any 
n, let jn denote the index of the last preceding switch-point, i.e. n € [tj + 1, tj+i]. Now define the oracle 



with /(n) 



M •= 



r(*jn + 1)^^^1 for ^iiy -P* ^ -^* ^iid ^iiy ^" ^ G -^" ^ If we can apply Lemma[6]to Wq, 



n"*^'^, (7(ra) 



[n^/'^], ci = a^'^A and C2 = 0, we will obtain 



lim sup 



T.Un{P\P.. 



n 



1/3 



< a^/^A 



(58) 



for any a > 1. Theorem |4] then follows, because the left-hand side of this expression does not depend on a. 
It remains to show that conditions (i)-(iii) of Lemma[6]are satisfied. 

Condition|I]follows because tj^ + 1 < n. Condition |n]is implied by the fact that lo^^ has only a logarithmic 
number of switch-points: It satisfies tti^^^ (n) < [log^n] +2. Consequently, 



m^^{n){\ogn + log g{n)) < ([log^n] +2)(logn+ [n^/^]) = o(n^/^). 



(59) 



To verify Condition [mj note that the selected number of bins is close to [n^/'^] in the sense of Theo- 
remfTSl For n € [tj + 1, tj+i] it follows from {tj^i)/{tj + 1) < a that 



\tj + 1)'/'" 



n 



n/{tj + 1) 



1/3- 



\{n/a 



ii/3i 



\n 



l/3i 



(60) 



We can therefore apply Theorem [TSl to obtain 

n n n 

sup ^ri(P*,JlJ<^ sup ri(P*,ILj^ia2/3c^r2/3^ia2/3Ani/3_ (^^^ 



P*^M* 



p*eM* 



i=l i=l i=l 

This shows that Condition [m] is satisfied and Lemma[6]can be applied to prove the theorem. D 



A.5 Proof of Proposition |7] and Proposition [15] 

We will actually prove a more general proposition that implies both Proposition [T] and 
any prediction strategy. Now define the prediction strategy 



Let Jmm-fix DC 



-^ Cesaro V^n | ^ 



r^-l^ 



1 " 

/ ^ -^ mm-fix \' 

n ^-^ 



Xn X 



„^-l^ 



i=l 



Thus, Pcesaro IS obtained as a time ("Cesaro"-) average of Pmm-fix- 
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Proposition 19. Suppose that M* is standard nonparametric, and that Pmm-fix achieves the minimax rate 
in Cesdro mean, i.e. supp,g_vj. Ya=i ri{P* , Pmm-fix) ^ Gmm-fLx{n). Then 

gmm{n) < sup r„(P*,Pcesaro) ^ n^^Gmm-dxin) < n"^Gmm-var("-) ^ 5mm(n). 
P*£M* 



Proof, (of Proposition [I9]) We show this by extending an a rgument from JYang and Barronlll999l. p. 1582]. 



By apply i ng Jen sen's inequalit y as in Proposition 15.2 of JGriinwaldL 1200711 (or the corresponding results 



in JYangl. I2OOOI1 or lEang and Barronl. E999]) it now follows that, for all P* G M*, rn(P*, Pcesaro) < 



^ EILi "-iiP*^ Pmm-iix), SO that also 

-. n 

SUprn{P*, -Pcesaro) < SUp - V" ri{P* , Pmm-fix)- (62) 

p* p* 71 — 

2=1 

This implies that 

n 

ngmm{n) < n ■ sup r„(P*, Pcesaro) ^ Gmm-fixCn) < (^mm-varC"-) = y^5mm(^)- 

P* ^ 

1=1 

Therefore, it suffices to show that for standard nonparametric models, Yll=i fi'mml^) ^ 'ng^^{n). By (|36l ). 
5mm(^) ^ i~^hQ{i) for some increasing function /iq. Then 

n n ^ (a) 

'^Ommi'i) = '^'i^^ho{i) < ho{n)'^r'^ ■< ho{n)n^~^ = n ■ n~^ho{n) >: ng^^{n). (63) 

4 = 1 1=1 1=1 

where (a) follows by approximating the sum by an integral. The result follows. D 

A.6 Proof of Lemma [11] 

Proof. Let P* € TW* be arbitrary. We may transform (pi to i/^i, (f)2 to il)2 and so on, such that for each k, 
{ipi, . . . , ipk) is an orthonormal basis for Sk with respect to P*. For any k, each P € TVffc may now be 
parameterized by i] = (r/(i), . . . , ?y(fc)) G I^^, which means that P^ = P expresses Yi = X]j=i 'n{j)'4^j{-^i) + 
C/j. Now let A; G Z+ be arbitrary and define fj such that P^ = Pjj. Let V' := (V'l) • • • > '^kY ■ Using the fact 
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that the errors are normally distributed, for any rj G M*^, abbreviating 'ip{X) to ip, we have 



D{P*\\P^)-D{P*\\Pi^ 



1 



^EE [(Y - if^f - (y - f^l:f I X] 

^E [-2E[Y\X]{,^ - f,Y^ + {rj'ijf - {f^'ijf] 



1 

2^ 

1 

2^ 

1 

2^ 

1 



S 



S 



£; 



(k oo \ / k \ 

j=i j=k+i J \j=i J 

k / k \ 



1 

2^' 



^E [{fi"^ - r^'^^f] = ^E [{f, - r/)VV'''(r/ - r/)] 



2a2 

1 n^ 

^(fy-7?) (r/-T?) 



(64) 



Here the outer expectation on each line is expectation according to P^, the marginal distribution of X under 
P*. In the fourth equality, B = \YlT=k+i^{3)'^) (l^i=i(%) ~ ^0))^)' which, by orthogonality of the 
ipj, is equal to 0. The final equality also follows by orthogonality. 

Now fix n > fc, and let r)„ denote the maximum likelihood parameter value in the ry-parameterization 
based on data X"-\ i.e. P^„ := Pk{Yn = • | X", V^"^) (note that Pk{Yn = ■ \ X", y""^) itself does not 
depend on the choice of basis). Using (l64l ). we can rewrite (l39l ) as follows: 

E [{fin-l - f}n-lf {fln-l - fjn^l)] > E [(??„ - ??n)'^(j?n " 5?n)] , (65) 

where now the expectation is over X^~^ ,Y^'^ , sampled i.i.d. from P*. It thus remains to show that (1651 ) 
holds. 

Write ^(") for the n x /c design matrix with (j, i)-th entry given by ^j(xi). We show further below that, 
if xi, . . . , Xn-i are such that (\I/("~i))^\I^("~i) is nonsingular, then the variance of fjn^i is at least as large 
as the variance of ?}„, i.e.: 



E\ 



fjn-lfifl - f]n-l) I X" = X"] > Eiif, - finYifl - fin) \ X^ 



(66) 



Since, by our assumptions, for all k, all n, 

Eiif] - f]nf{fl - fin) I (^("))'^^(") is singular] < oo, 

where, also by assumption, the event that (\I/"))T\I/(") is singular has P* -measure 0, it follows that (1651 ) is 
implied by (l66l ). Thus, it remains to prove (l66l ). We prove (l66l ) b y shghtly ad justing an existing geometric 
proof of the related (but non-equivalent) Gauss-Markov theorem iRuudLll995n . Define, for given x". 



p = ^(") ( ( ^(") ) ^( 



$(")^ ; Q = ^(") ( ('^("-1)') ^("-1) 



^(n-l)") J^ 
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where J is the (n — 1) x n matrix with Ji i = . . . = J„_i „_i = 1, and all other entries equal to 0. Letting 
y = (yi, . . . , y™)^, we see that P is a projection matrix, and 

Py = ^^"^r/„ ; Qy = *(")r)„_i. (67) 

Now, for arbitrary a € M", we have 

var(a'^Qy | x") = var(a^Py | x") + var(a^(Q - P)y | x") + 2cov(a'^(Q - P)y, a^ Py \ x"). (68) 

A straightforward (but tedious) calculation shows that 

cov(a'^(Q - P)y,a^Py \ x") = a^a^{QP'^ - PP'^)a. 

As P is symmetric, P^ = P, and for all y G M", y := Py is in the column space of $("), so that Py = y, 
and PP^y = y. But since Q^^"^ = ^("), and y is in the column space of vI/("), we must also have Qy = y 
and QP^y = y. Thus, for arbitrary y, QP^y = PP^y, and it follows that the cov-term in (1681 ) is equal to 
0. Thus, (I68]l imphes that 

var(a^Qy [ x") > var(a'^Py [ x") (69) 

Now apply this with 

-1 



a := (1, 1, . . . , 1)^ • (Ui^^y ^/(''A ('^W 



where the leftmost vector is a /c-dimensional vector of Is. By ( [67] ). (|69l ) now becomes equivalent to 

var J2^j=i fin-i,j > var Y!j=i 'Hnj, which is just ^. D 

A.7 Proof of Theorem M 

Before we prove Theorem [TH we need to establish some additional properties of the prior vr as defined 
in ^. To this end, let us define the random variables 

Sn{s) := l(n-i)e{ti,...,t™}(s), (70) 

M„(s) := l„>i„(s) (71) 

for all n G Z+ and s = ((ti, ki), . . ., {tm, km)) G §■ These functions denote, respectively, whether or not 
a switch occurs between outcome X„_i and outcome X„, and whether or not the last switch occurs some- 
where before outcome n. We also define ^n(s) := (S'„(s), M„(s), Kn{s)) as a convenient abbreviation. 

Every parameter value s G S determines an infinite sequence of values ^i, ,^2. • • •> and vice versa. 
The advantage of these new variables is that they allow us to interpret the prior as a sequential strategy 
for prediction of the value of the next random variable ^„+i (which in turn determines the distribution 
on Xn+i given x"), given all previous random variables ^" := (Ci, • • • ,^n)- In fact, we will show that 
Psv/i^n+i I X^,^"-) = TT{^n+i \ ^"')- We therefore first calculate the conditional probability vr(,^„+i|,^") 
before proceeding to prove the theorem. As it turns out, our prior has the nice property that 7r(^„+i | ^") = 
^(Cn+i I Mn, Kn), which is the reason for the efficiency of the algorithm. 

Lemma 20. Let7T{s) = e'^'^il - e)TT^{ki)lYll2'^T{ti\ti > ti^i)TT^{k) as in ^. Then 

7r(^i ={ ^ ' ' (72) 

^''^ \K^{K^)(\-e) ifMi = L 
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And for n > 1 



7I"(Cn+l I O = ^(Cn+1 I Mn,Kn),and 



(73) 



7r(Cn+l = (Sn+1, rrin+l, K+l) \ Mn = nin, Kn = K) (74) 

■n-,{T > n\T > n) if s„+i = 0, m„+i = m„ = 0, A;„+i = kn, 

1 if S„+i = 0, run+i =mn = l, kn+i = kn, 

■K^iT = n\T > n)TT^{kn+i)0 if s„+i = 1, m„+i = m„ = 0, (75) 

7rT.(T = n\T > n)Tr^{kn+i){l - 9) if s„+i = 1, m„+i = 1, m^ = 0, 

otherwise. 

Proof. To check (1721 ). note that we must have either ,^i = (1, 1, k) for some k G Z+, which corresponds to 
s = ((0, k)) which has probability 'Kf^{k){l — 6) as required, or ,^i = (1, 0, k). The latter corresponds to the 
event that m > 1 and Ki = k, which has probability 7:^{k)9. 

We proceed to calculate the conditional probability 7r(,^„+i | ^") for n > 1. First suppose Mn(s) = 0. 
Let An{s) := max{i | tj < n} = Y17=i ^i count the number of switches before n. Also note that ^" and 
Mn = determine ti, . . ., tA„, ki, . . ., kA„, that tA„ > n and m{s) > An, and vice versa. Hence for any n 

tt{C such that M„ = 0) = 

■K„{m > An)'7r(ti,...,tAr,,n < tA„+i,ki, . . . ,kA„ {h < ... < tA„+i,m > An). (76) 

Likewise, for M„ = 1 

7r(C" such that M„ = 1) = 7r„{m = A„) 7r(ti, . . . , tA„, fci, . . . , A;a„ | ti < . . . < tAr^m = An). (77) 

From d76l ) and dTT] ) we can compute the conditional probability vr(^„_|_i | ^"). We distinguish further on 
the basis of the possible values of Sn+i and Mn+i. Note that Mn+i = implies M„ = and M„+i = 1 
implies M„ = 1 — Sn+i- Also note that Sn+i = implies An+i = An and Kn = Kn+i, and that Sn+i = 1 
implies An+i = A„ + 1 and tyi„+i = n. Conveniently, most factors cancel out, and we obtain 



7r(5„+i = 0, Mn+i = 0, Kn+i = k\C s.t. M„ = 0, i^„ = k) 
T:{Sn+i = 0, M„+i = 1, Kn+i = k\e s.t. Mn = l,Kn = k) 



vr(tA„+i > n + 1 I tA„+i > n) 
Tr,{T>n\T>n), (78) 

1, (79) 



TT{Sn+l = 1, M„+i = 0, Kn+1 = k\C S-t. Af„ = 0) 

= 7r„(m > A^ + 1 I m- > A„)7r(tA„+i = n \ tA„+i > n)TT^{k) 
= eiT^{T = n\T>n)7r^{k), (80) 

TT{Sn+l = 1, Mn+1 = 1, i^„+l = k\C S-t. M„ = 0) 

= TT„{m = An + l\m> An)-K{tA„+i = n I tA„+i > n)TT^{k) 
= {l-e)TT,{T = n\T>n)TT^{k). (81) 

The observation that these conditional probabilities depend only on M„ and Kn shows that -K{^n+i \ C") = 
7r(^n+i I Mn, Kn), which completes the proof of the lemma. D 
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Proof of Theorem \T4\ Note that ^"(s) completely determines qs{X^). Therefore let q^n[X^) = q^nu\{X"') 
= gs(^")- It follows that 

Psw(r+' = e"+\X") = Yl ^(«)'?s(^") (82) 

g- tn + l^gTi + l 

s:^"+i{s)=e"+l 

= ge(^Xr = eXCn+i = e„+i|r = e") (84) 

= Psw(r = e", X")vr(e„+i = e„+i | C = e"), (85) 

which together with Lemma |20] implies that 

Psw(Cn+l I C, X^ = 7r(C„+i I C) = Ain+1 | M„, K„). (86) 

We will now go through the algorithm step by step to show that the invariants w^ = P(x"~^,M„ = 
0, Kn = k) and w\ = P(x"~^, Af„ = 1, Kn = k) hold for all A: G /C„ at the start of each iteration through 
the loop (before line|3]l. These invariants ensure that wf. + w\. = P{x"'~^,Kn = k) so that the correct 
probabilities are reported. 

Linen initializes wl to 'K^{k)e = 7r(5i = l,Mi = 0,Ki = k.) = p^^{x^,Mi = 0,Ki = k) for 
k G /Ci. Likewise w^ = 7r(/c)(l - 6*) = 7r(S'i = 1, Mi = 1, /fi = /c) = psw(2;°, Mi = l,Ki = k). Thus 
the loop invariant holds at the start of the first iteration. 

We proceed to show that the invariant holds in subsequent iterations as well. In the loss update in line |4] 
we update the weights for k G /C„ to 

wl = Psw(x"~\ M„ = 0, K„ = /C) • Pk{Xn I X"-1) 

= E ^(^) (nP^»(^^l^'~'))p^"(^«l^"~') = Psw(x",M„ = 0,if„ = A:). 

Similarly w\ = p^^{x"',Mn = l,Kn = k). Then in line |5l we compute pool = ^^{Z = n \ Z > 
n) YlkeKnP^"^^^"^ ' ^"^ ~ 0'-^" = k) = TTj^Z = n \ Z > n)psw{x"' , Mn = 0). Finally, we consider the 
loop that starts at line |6] and ends at hne |9] First note that for A; G /C^ by applying Lemma |20] and (l86l ) we 
obtain 

wItt^{Z > n\ Z >n) = 

= Psw(2;", Mn = 0,Kn = k)-K^{Z > fl \ Z > fl) 

= p.sw(x", Mn = 0,Kn = k)p,JSn+l = 0, M„+i = 0, K^+i = A: | x", M„ = 0, Kn = k) 

= Psw(2;", Mn = 0, Kn+1 = K^ = k, Sn+1 = 0, M„+i = 0) 

= p.sw(2;", Sn+1 = 0, M„+i = 0, Kn+1 = k). (87) 

Similarly we get for k ^ Kn that 

wi = Rsw(x", Mn = l,Kn = k)= Rsw(x", Sn+1 = 0, M„+i = l,Kn+l = k). (88) 

As Sn+1 = implies Kn+i = Kn, we have for k G ICn+i \ JCn that 

PsUx'',Sn+i = 0,Mn+i = 0,Kn+i = k) = 0, (89) 

p^Ux", Sn+1 = 0,Mn+l = 0, Kn+1 = k) = 0. (90) 
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By Lemma[20]and (l86l) we also get that 

poolTT^{k)e = TT,{Z = n\ Z > n)p^^{x'^,Mn = 0)Tr^{k)e 

= Psw(x", Mn = 0)psw(5n+l = 1, Mn+1 = 0, K^+l = /c | x", M„ = 0) 

= Psw(x", Mn = 0, 5„+i = 1, M„+i = 0, Kn+1 = k) 

= psw(x", 5„+i = 1, Mn+1 = 0, Kn+1 = k), (91) 

and similarly 

pool7r^(A:)(l - 0) = tt,{Z = n\Z> n)p,^{x'',Mn = 0)TT^{k){l - 6) 

= psw(x", Sn+1 = 1, Mn+l = 1, Kn+1 = k). (92) 



Together, ([87),(i8),([89),(l9T), and (192]) imply that at the end of the loop 

wl = Psw(a;", Sn+l = 0, Mn+l = 0, Kn+l = k)+ Psw(2;", Sn+l = 1, Mn+l = 0, Kn+l = k) 

= Psw(x", Mn+l = 0, Kn+l = k), 

wi = Pswix"', Sn+l = 0, Mn+l = 1, Kn+l = k) + p^^{x^ , Sn+l = 1, Mn+l = 1, Kn+l = k) 

= Pswix"", Mn+l = 1, Kn+l = k), 

which shows that the loop invariants hold at the start of the next iteration and that after the last iteration the 
final posterior is also correctly reported based on these weights. D 
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